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This  symposium  consisted  of  five  papers: 

1.  James  R.  McBride:  A Brief  Overview  of  Adaptive  Testing 

Adaptive  testing  is  defined,  and  some  of  its  item  selection  and  scoring 
strategies  briefly  discussed.  Item  response  theory,  or  item  character- 
istic curve  theory,  which  is  useful  for  the  implementation  of  adaptive 
testing  is  briefly  described.  The  concept  of  information"  in  a test 
is  introduced  and  discussed  in  the  context  of  both  adaptive  and  conven- 
tional tests.  The  advantages  of  adaptive  testing,  in  terms  of  the 
nature  of  information  it  provides,  are  described. 
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2.  James  B.  Sympson:  Estimation  of  Latent  Trait  Status  in  Adaptive  Testing 

Procedures 

The  role  of  latent  trait  theory  in  measurement  for  criterion  prediction 
and  in  criterion-referenced  measurement  is  explicated.  It  is  noted  that 
latent  trait  models  allow  both  normed-referenced  and  criterion-referenced 
interpretations  of  test  performance.  Using  a 3-parameter  logistic  test 
model,  an  example  of  sequential  estimation  in  a 20-item  adaptive  test  is 
presented.  After  each  item  is  administered,  four  different  ability  esti- 
mates (two  likelihood-based  and  two  Bayesian  estimates)  are  calculated. 
Characteristics  of  the  four  estimation  methods  are  discussed.  The  infor- 
mation available  in  the  items  selected  by  the  adaptive  test  is  compared 
with  the  information  available  from  comparable  "rectangular"  and  "peaked" 
non-adaptive  tests.  The  joint  application  of  latent  trait  theory  and 
adaptive  testing  is  advocated  as  a useful  approach  to  human  assessment. 


3.  C.  David  Vale:  Adaptive  Testing  and  the  Problem  of  Classification 

The  use  of  adaptive  testing  procedures  to  make  ability  classification 
decisions  (i.e.,  cutting  score  decisions)  is  discussed.  Data  from  com- 
puter simulations  comparing  conventional  testing  strategies  with  an 
adaptive  testing  strategy  are  presented.  These  data  suggest  that, 
although  a conventional  test  is  as  good  as  an  adaptive  test  when  there  is 
one  cutting  score  at  the  middle  of  the  distribution  of  ability,  an  adap- 
tive test  can  provide  better  classification  decisions  when  there  is  more 
than  one  cutting  score.  Some  utility  considerations  are  also  discussed. 


4.  Steven  M.  Pine:  Applications  of  Item  Characteristic  Curve  Theory  to  the 

Problem  of  Test  Bias 

It  is  argued  that  a major  problem  in  current  efforts  to  develop  less 
biased  tests  is  an  over-reliance  on  classical  test  theory.  Item  Charac- 
teristic Curve  (ICC)  Theory,  which  is  based  on  individual  rather  than 
group-oriented  measurement,  is  offered  as  a more  appropriate  measurement 
model.  A definition  of  test  bias  based  on  ICC  theory  is  presented.  Using 
this  definition,  several  empirical  tests  for  bias  are  presented  and  demon- 
strated with  real  test  data.  Additional  applications  of  ICC  theory  to 
the  problem  of  test  bias  are  also  discussed. 


5.  Isaac  I.  Bejar:  Applications  of  Adaptive  Testing  in  Measuring  Achievement 

and  Performance 

The  paper  reviews  two  relatively  recent  developments  in  psychometric 
theory,  the  assessment  of  partial  knowledge  and  research  in  adaptive 
testing.  It  is  argued  that  the  use  of  non-dichotomous  item  formats, 
needed  for  the  assessment  of  partial  knowledge,  and  now  made  possible  by 
the  administration  of  achievement  test  items  on  interactive  computers, 
should  result  in  achievement  test  scores  which  are  a more  realistic  and 
precise  indication  of  what  a student  can  do. 
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Applications  of  Computerized  Adaptive  Testing 


a brief  overview  of  adaptive  testing 

JAMES  R.  McBRIDE 

U.  S.  Army  Research  Institute  for  the  Behavioral  and  Social  Sciences 


This  symposium  will  present  some  recent  developments  in  adaptive  testing  which 
have  applications  to  several  military  testing  problems.  The  purpose  of  this  over- 
view is  to  provide  a brief  introduction  to  adaptive  testing — what  it  is,  what  is 
needed  to  implement  it,  and  why  it  is  of  interest. 

"Adaptive"  testing  is  one  of  a number  of  terms  used  to  describe  a procedure 
whereby  the  test  items  that  comprise  an  individual's  test  are  selected  during 
the  test  itself.  Some  of  the  other  terms  used  interchangeably  with  adaptive  testing 
include  tailored  testing,  branched  testing,  programmed  testing,  and  individualized 
testing.  The  term  "adaptive"  was  chosen  because  these  tests  adapt  themselves  to 
the  examinee;  different  persons  answer  different  items,  with  the  items  chosen 
sequentially  to  suit  the  individual  examinee's  performance. 

Differential  selection  of  test  items  may  be  accomplished  in  any  number  of 
ways.  But,  generally,  in  adaptive  tests  a more  difficult  item  is  administered 
following  each  correct  answer,  and  an  easier  item  following  an  incorrect  one.  Some 
methods  of  adaptive  testing  have  been  implemented  in  paper-and-pencil  mode;  for 
example.  Lord's  (1971)  flexilevel  adaptive  test  was  designed  specifically  for 
paper-and-pencil  administration.  However,  experience  has  shown  that  the  instruc- 
tions for  paper-and-pencil  adaptive  tests  are  too  complex  for  some  examinees  to 
follow  successfully  (Weiss  & Betz,  1973,  p.  23)  A more  satisfactory  mode  of  admin- 
istration is  through  use  of  an  interactive  computer  terminal  or  similar  device. 

Thus,  Weiss  (1976)  chose  to  administer  adaptive  tests  at  a cathode-ray  terminal 
(CRT);  Bayroff,  Ross  and  Fischl  (1974)  reported  the  Army's  development  of  a 
computer-controlled  slide  projection  terminal  for  adaptive  testing;  Waters  (1977) 
designed  and  built  a micro-processor  terminal  which  directs  the  examinee  through 
an  adaptive  sequence  of  test  items  read  from  a printed  booklet. 

Item  selection  strategies.  Because  adaptive  tests  are  quite  different  from 
conventional  tests  in  which  all  examinees  must  answer  the  same  set  of  test  items, 
adaptive  testing  poses  some  new  psychometric  problems.  One  problem  is  how  to 
choose  successive  items  from  the  pool  of  available  items.  This  problem  can  be 
solved  through  an  item  selection  strategy,  which  defines  a formalized  rule  for 
item  choice. 

Numerous  item  selection  strategies  are  possible.  They  vary  from  very  simple 
two-branch  rules  to  rules  based  on  the  optimization  of  rather  complex  mathematical 
functions  (Weiss,  1974).  Obviously,  computerizing  the  item-selection  process 
facilitates  the  use  of  the  mathematical  optimization  procedures. 


Scoring  adaptive  tests.  Since  different  examinees  take  sets  of  test  items 
which  may  differ  in  number,  difficulty,  and  discriminating  power,  the  traditional 
number  correct  score  will  not  suffice  to  order  people  on  most  adaptive  tests.  Some 
scoring  procedure  is  required  which  will  consider  not  only  how  many  items  were 
answered  correctly,  but  also  which  items  were  taken,  and  the  pattern  of  right  and 
wrong  answers  to  those  items.  The  scoring  procedures  most  widely  used  in  adaptive 
testing  are  based  on  various  formulations  of  latent  trait  theory  (e.g.,  Bimbaum, 

1968;  Lord,  1952,  1974;  Rasch,  1960).  All  of  these  formulations  provide  statis- 
tical methods  for  locating  examinees  on  a common  scale,  even  though  they  responded 
to  different  sets  of  test  items. 

Item  response  theory.  Because  of  the  unique  characteristics  of  adaptive 
tests — tailoring  each  test  to  the  individual  and  locating  all  examinees  on  a common 
scale  despite  the  different  items  constituting  each  test — traditional  test  theory 
is  inadequate  for  use  in  adaptive  testing.  "Latent  trait"  or  "item  response" 
theory  (Lord,  1952,  1976)  provides  an  adequate  theoretical  basis  for  the  develop- 
ment of  adaptive  testing. 

Item  response  theory,  also  known  as  item  characteristic  curve  theory,  is  a 
general  term  for  theoretical  formulations  which  account  for  examinees'  responses 
to  test  items  in  terms  of  their  status  on  an  underlying  attribute.  In  ability 
(or  achievement)  testing,  the  higher  the  attribute  status,  the  larger  is  the 
probability  of  a correct  response  to  any  given  item  which  measures  the  trait  in 
question.  Through  appropriate  scaling  procedures,  a response  curve  can  be  con- 
structed for  every  such  test  item.  This  item  characteristic  curve  (ICC)  expresses 
the  probability  of  a correct  response  as  a mathematical  function  of  the  scaled 
trait  and  the  item  characteristics. 

Every  person  can  be  characterized  by  his/her  location  on  this  scale.  Every 
test  item  also  has  a location  parameter  (its  threshold,  or  "difficulty")  and 
perhaps  its  own  rate  parameter  (proportional  to  the  steepness  of  the  ICC) , analogous 
to  its  discriminating  power.  Some  items  also  have  a lower  asymptote,  or  guessing 
parameter. 

Knowing  which  items  a person  has  answered;  the  difficulty,  discrimination, 
and  guessing  parameters  of  those  items;  and  whether  the  answers  were  correct  or 
incorrect  permits  the  use  of  the  statistical  techniques  of  item  response  theory 
to  estimate  the  examinee's  ability.  The  resulting  ability  estimate  is  a "test 
score"  of  sorts  which  has  an  error  component  like  any  other  observed  score.  Unlike 
classical  test  theory,  item  response  theory  makes  no  assumption  that  measurement 
errors  are  independent  of  "true  score",  which  is  appropriate  because  this  central 
assumption  of  classical  test  theory  is  untenable  (Lumsden,  1976).  Whether  ability 
is  defined  as  "true  score"  or  as  location  on  a latent  continuum,  errors  of  measurement 
can  vary  at  different  levels  of  the  trait,  reflecting  in  part  the  discrepancy 
between  examinee  trait  level  and  the  difficulties  of  the  test  items. 

Information.  Item  response  theory  permits  the  evaluation  of  something  closely 
akin  to  the  standard  error  of  measurement  as  a function  of  underlying  ability,  if 
the  test  item  parameters  are  known.  This  is  called  the  test  information  function 
(Birnbaum,  1968)  which  is  inversely  proportional  to  the  standard  error  of  estima- 
ting an  examinee's  location  on  the  trait  scale.  If  the  information  function  of  a 
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typical  peaked  conventional  test  (one  whose  items  are  all  about  equal  in  difficulty) 
were  plotted,  its  test  information  function  would  likewise  be  peaked — very  high 
over  a narrow  range  of  the  trait,  but  diminishing  in  magnitude  elsewhere.  Such  a 
test  will  discriminate  very  well  over  a narrow  interval  of  the  trait  range;  it  will 
not  discriminate  as  well  outside  that  interval.  The  ability  level  at  which  the 
test  information  function  is  highest  can  be  referred  to  as  the  test  "center". 

The  information  function  of  a "rectangular"  conventional  test  (one  whose 
item  difficulties  are  uniformly  distributed  over  a wide  range)  is  fairly  flat,  but 
low  over  a broad  interval  on  the  trait  scale  around  the  test  center.  This  test 
would  measure  about  equally  well  over  a much  wider  range  than  the  peaked  test, 
but  other  things  being  equal,  would  not  discriminate  nearly  as  effectively  as 
does  the  peaked  test  at  its  center. 

The  design  of  conventional  tests.  A test  measures  best  (most  precisely)  where 
its  information  function  is  highest  (and  hence  its  standard  error  is  lowest) . 

It  is  frequently  desirable  to  have  high  measurement  precision  over  most  of  the 
normal  range  of  the  attribute  we  seek  to  measure.  This  is  tantamount  to  a high, 
flat  information  function.  Conventional  testing,  however,  presents  a dilemma.  A 
peaked  test  can  be  constructed  which  yields  an  information  function  with  a high 
peak;  or  at  the  other  extreme,  a rectangular  test  can  be  built  which  has  a low, 
flat  information  function.  A test  with  a high,  flat  information  function  cannot 
be  constructed  for  conventional  test  administration  unless  it  is  extremely  long. 

This  problem  can  be  referred  to  as  a "bandwidth-fidelity  dilemma",  with 
apologies  to  Cronbach  (1961) , who  described  a different  "bandwidth-fidelity 
dilemma".  The  designer  of  a conventional  test  can  construct  it  to  have  high 
"fidelity" — high  precision,  low  measurement  error — over  a narrow  range  of  ability; 
or  to  have  a broad  "bandwidth" — equiprecision  of  measurement  over  a wide  range 
of  ability,  at  the  expense  of  fidelity.  In  designing  a conventional  test,  there 
is  a tradeoff  between  broad  bandwidth  and  high  fidelity;  the  designer  cannot  have 
both. 

Adaptive  testing.  Herein  resides  the  most  attractive  feature  of  adaptive 
tests  from  a psychometric  point  of  view:  Because  the  test  is  adapted  to  the 

individual,  the  discrepancy  between  trait  level  and  item  difficulty  can  be  made 
both  small  and  fairly  constant  across  the  trait  range.  The  result  is  a flat 
information  function  which  is  also  generally  high.  Adaptive  tests — and  only 
adaptive  tests — are  capable  of  accurate,  equiprecise  measurement  over  a wide 
ability  range.  This  should  pay  dividends  in  test  reliability,  criterion-related 
validity,  and  in  the  general  utility  of  the  test  for  a broad  range  of  measurement 
and  decision  applications. 

A properly  designed  adaptive  test  will  have  higher  reliability  than  a conven- 
tional test  of  the  same  length.  As  a corollary  to  that,  an  adaptive  test  can 
achieve  a specified  level  of  reliability  in  substantially  fewer  items  than  can  a 
conventional  test,  thus  permitting  the  measurement  of  additional  attributes  in 
the  time  saved.  Both  improved  reliability  and  additional  measurements  should  result 
in  an  increment  in  predictive  validity  over  that  obtained  using  conventional  tests. 
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In  addition  to  the  psychometric  benefits  accruing  from  the  use  of  adaptive 
tests,  there  are  psychological  benefits  to  the  examinees.  Adaptive  tests  can  have 


positive  effects  on  the  test-taking  motivation  of  examinees  (Betz  & Weiss,  1976b.) 
and,  for  some  testees,  on  their  measured  ability  levels  (Betz  & Weiss,  1976a). 

By  tailoring  test  difficulty  to  examinee  ability,  adaptive  tests  can  reduce  the 
effects  of  guessing  among  low-ability  examinees  and  make  any  remaining  effects 
relatively  constant  across  ability  levels. 

Summary 

This  overview  has  presented  a rather  broad-brush  introduction  to  adaptive 
testing.  Hopefully,  it  has  conveyed  some  conception  of  what  adaptive  testing 
is,  of  the  rudiments  of  the  test  theory  supporting  it,  and  of  the  significant 
psychometric  and  psychological  advantages  that  can  accrue  when  a well-designed 
adaptive  testing  program  is  implemented  in  a mental-measurement  setting.  The 
four  principal  papers  in  this  symposium  will  deal  in  more  detail  with  some  methods 
used  in  conjunction  with  adaptive  testing,  and  with  a variety  of  areas  of  appli- 
cation of  adaptive  tests  which  are  relevant  to  the  needs  and  problems  of  test 
users  in  the  military. 


ESTIMATION  OF  LATENT  TRAIT  STATUS  IN  ADAPTIVE  TESTING  PROCEDURES 

JAMES  B.  SYMPSON 

University  of  Minnesota 


During  the  last  few  years,  latent  trait  theory  has  become  increasingly 
important  as  a theoretical  foundation  for  the  practice  of  psychological  and 
educational  assessment.  This  has  been  due  to  shortcomings  inherent  in  classical 
test  theory  (Lumsden,  1976)  and  to  recent  developments  in  testing  practice.  In 
particular,  when  "adaptive"  or  "individualized"  testing  is  desired,  latent  trait 
theory  provides  a particularly  useful  conceptual  scheme  for  guiding  test  design  and 
test  scoring  procedures. 

Latent  trait  theories  are  characterized  by  a mathematical  model  that  relates 
the  probability  of  occurrence  of  a particular  response  class  (e.g.,  a "correct" 
response)  in  the  presence  of  a particular  stimulus  (e.g.,  a test  item)  to  a person's 
position  on  one  or  more  metric  dimensions.  The  graph  of  the  function  that  relates 
probability  of  a particular  response  class  to  a person's  status  on  these  dimensions 
can  be  referred  to  as  a response-characteristic  surface • 

Both  univariate  and  multivariate  latent  trait  models  have  been  proposed.  The 
univariate  models  (e.g.,  Birnbaum,  1968;  Bock,  1972;  Lord,  1952;  Rasch,  1960) 
assume  that  response  probabilities  are  related  to  the  relative  positions  of  persons 
and  stimuli  on  a single  metric  dimension.  Multivariate  models  (e.g.,  Christoffer- 
son,  1975;  Samejima,  1974)  allow  for  the  possibility  of  several  latent  dimensions. 

Latent  Trait  Theory  and  the  Objectives  of  Measurement 

When  they  first  encounter  latent  trait  theory,  many  people  question  its 
practical  utility.  For  example,  they  often  ask,  "Why  should  I bother  with  an 
approach  to  testing  that  involves  inferred  latent  traits  if  what  I'm  really 
interested  in  is  either  predicting  some  criterion  accurately  or  achieving  content 
validity  and  implementing  criterion-referenced  measurement?"  In  order  to  motivate 
an  interest  in  latent  trait  estimation  procedures,  it  will  be  useful  to  discuss 
briefly  the  issues  raised  by  this  type  of  question. 

The  "existence"  of  latent  traits.  The  adoption  of  latent  trait  theory  as  a 
guide  to  test  construction  and  test  scoring  does  not  require  a belief  in  the 
"existence”  of  unobservable  traits  that  control  human  behavior.  Empirically,  it  is 
sufficient  to  inquire  whether  peoples'  responses  to  test  stimuli  can  be  predicted 
accurately  on  the  basis  of  such  a model.  The  postulated  dimensions  of  latent  trait 
theory  can  be  viewed  as  quantitative  variables  that  are  created  by  calibrating  and 
scoring  test  items  in  a certain  way.  These  variables  can  provide  a convenient  basis 
for  designing  testing  procedures  and  may  lead  to  increased  predictive  accuracy  in 
scientific  and  practical  applications. 


This  research  is  supported  by  contract  N00014-76-C-0243 , NR150-382,  with  the 
Personnel  and  Training  Research  Programs,  Office  of  Naval  Research. 
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Measurement  for  criterion  prediction.  In  many  situations,  tests  are  developed 
and  applied  with  the  sole  intention  of  predicting  performance  on  a criterion  of 
interest.  The  introduction  of  intervening  variables  (latent  traits)  might  seem 
unnecessary  when  one  is  only  interested  in  obtaining  a high  degree  of  relationship 
between  test  scores  and  criterion  scores.  However,  estimates  of  latent  trait  status 
can  themselves  be  viewed  as  a particular  variety  of  test  score.  Such  scores  may  or 
may  not  have  higher  predictive  validity  than  more  conventional  test  scores;  this 
is  an  empirical  question.  But, even  if  predictive  validity  is  not  increased  via  the 
use  of  latent  trait  scores,  it  may  still  be  advantageous  to  adopt  a latent  trait 
approach  if  the  testing  process  can  be  made  more  efficient  as  a result  (e.g.,  through 
adaptive  testing  procedures) . 

Moreover,  test  development  for  the  purpose  of  criterion  prediction  is  always 
based  upon  an  implicit  structural  model.  No  one  chooses  items  at  random  from  all 
conceivable  item  domains.  Test  developers  try  out  items  with  certain  kinds  of 
content  and  never  consider  using  other  kinds  of  content.  They  also  attempt  to 
generate  items  that  have  difficulty  levels  or  endorsement  rates  (i.e.,  p-values) 
that  are  not  too  extreme  in  the  population  to  be  tested.  This  is  done  so  that  item- 
criterion  correlations  will  not  be  unduly  restricted.  Such  procedures  suggest  the 
existence  of  an  implicit  structural  model. 

Trying  certain  types  of  items,  and  not  others,  implies  that  certain  types  of 
inter-person  differences  exist  and  are  related  to  criterion  performance,  while 
others  are  not.  More  generally,  any  conceptual  scheme  for  classifying  test  items 
implies  a corresponding  set  of  response  variables  that  can  be  generated  when  the 
items  are  administered.  In  selecting  items  for  criterion  prediction  the  test 
developer  indicates  the  response  variables  that  are  thought  to  be  related  to  the 
criterion . 

A concern  about  item  difficulties  and  endorsement  rates  implies  that  the 
probability  of  a given  response  to  an  item  is  a function  of  status  on  the  relevant 
response  variable(s).  If  such  probabilities  were  not  a function  of  status  on  the 
response  variables,  an  item  would  have  the  same  p-value  in  every  conceivable  popu- 
lation and  there  would  be  no  need  to  match  item  difficulties  to  the  population  that 
is  to  be  tested. 


A latent  trait  approach  to  test  construction  and  scoring  provides  a formal 
vehicle  for  elaborating  structural  models  and  encourages  the  test  developer  to  make 
structural  assumptions  explicit.  When  structural  models  are  explicitly  stated, 
they  can  serve  to  guide  test  construction  efforts  and  aid  in  the  interpretation  of 
empirical  results. 

' Content  validity  and  criterion- referenced  measurement.  The  testing  situation 

never  constitutes  the  entire  behavioral  domain  of  interest.  The  implicit  objective 
in  pursuing  content  validity  and  in  implementing  criterion-referenced  measurement 
is  to  make  more  accurate  inferences  about  a person's  potential  for  performance  in  a 
hypothetical  task  domain  (Cronbach,  1971,  p.  452;  Glaser  & Nitko,  1971,  p.  653). 
This  hypothetical  task  domain,  though  it  is  not  observable  in  its  entirety,  is 
carefully  defined  in  terms  of  performance  objectives  or  item  content.  Test  items 

\ are  generated  that  represent  the  domain,  and  responses  to  these  items  are  used  as  a 

basis  for  making  inferences  about  domain  performance. 

I 
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Some  individuals  protest  such  a view  and  argue  that  in  criterion-referenced 
measurement  the  test  stimuli  are  the  criterion  tasks  of  interest  and  that  no 
further  task  domain  is  intended  or  implied.  However,  unless  all  the  tasks  that  are 
required  on  the  job  are  included  in  the  test,  inferences  are  necessarily  being  made 
about  a larger  task  domain  from  a sample  of  person-stimulus  interactions  drawn  from 
the  domain. 

What  is  the  nature  of  the  hypothetical  task  domain  in  achievement  testing? 

Such  task  domains  can  be  described  in  terms  of  a multidimensional  structural  model. 
Whenever  test  stimuli  can  be  clustered  with  regard  to  common  content  or  process 
and  arranged  in  a learning  hierarchy  within  each  cluster,  there  is  a definite 
possibility  that  a latent  trait  approach  to  achievement  testing  will  be  useful. 

Norm-referenced  and  criterion-referenced  interpretations  of  test  performance. 
In  recent  years,  the  distinction  between  norm-referenced  and  criterion-referenced 
measurement  has  been  widely  discussed.  An  important  fact  to  keep  in  mind  is  that 
this  distinction  properly  applies  to  the  type  of  information  available  from  test 
scores,  not  to  test  content  or  the  testing  procedure  itself  (Hambleton  & Novick, 
1973,  p.  162).  This  is  important  because  estimates  of  latent  trait  status  can 
provide  information  about  both  inter-person  differences  (norm-referenced  interpre- 
tations) and  intra-person  response  probabilities  (criterion-referenced  interpreta- 
tions) for  tasks  drawn  from  a task  domain. 

An  estimate  of  an  individual's  latent  trait  status  can  be  converted  to  a 
centile  rank  or  standard  score  relative  to  any  norm  group  previously  tested  using 
the  latent  trait  procedure.  This  same  latent  trait  estimate,  when  considered  in 
conjunction  with  the  latent  trait  parameters  of  a test  item  (i.e.,  a task  sample) 
that  has  been  previously  calibrated,  allows  generation  of  the  probability  of 
occurrence  of  a given  response  class  (e.g.,  a "correct"  response)  in  the  presence 
of  the  item.  (That  is,  one  can  determine  the  probability  that  a person  will 
complete  a given  task  successfully,  even  though  the  person  has  never  attempted  the 
task.)  The  fact  that  latent  trait  theory  can  provide  both  norm-referenced  and 
criterion-referenced  interpretations  of  test  performance  indicates  that  the  current 
schism  between  psychological  and  educational  testing  may  be  narrowed  considerably 
in  the  years  to  come. 


Estimating  Latent  Trait  Status 

In  order  to  exploit  the  wide  range  of  potential  applications  of  latent  trait 
theory,  it  is  necessary  to  understand  procedures  for  estimating  latent  trait  status 
of  individual  testees.  Four  methods  for  obtaining  estimates  of  latent  trait  status 
are  described  below.  In  addition,  it  will  be  shown  that  the  accuracy  of  such  esti- 
mates can  often  be  improved  through  the  use  of  adaptive  testing  procedures. 

The  latent  trait  model  to  be  described  is  one  in  which  only  two  response  classes 
are  considered,  a keyed  response  and  a non-key ed  response,  and  the  probability  of 
occurrence  of  each  response  class  is  a function  of  a single  latent  dimension.  This 
model  might  be  applicable  to  a test  that  has  been  constructed  to  maximize  internal 
consistency  (Nunnally,  1967,  pp.  254-268)  and  in  which  items  are  scored  dichotomouslv . 
The  model  would  not  be  suitable  for  tests  that  involve  a multidimensional  item 
structure,  but  the  principles  of  latent  trait  estimation  that  are  discussed  can 
be  generalized  to  such  cases. 


-8- 


The  Three-parameter  Logistic  Model 

This  latent  trait  model  has  been  investigated  extensively  by  Birnbaum  (1968) . 
The  function  rule  that  relates  probability  of  a keyed  response  to  the  parameters 
of  the  model  is  given  in  Equation  1. 

P (0)  = og  + (1-e^)  [ 1 + exp (-1 . ) ]-1  [1] 

The  quantity  P (0)  is  the  probability  of  a keyed  response  to  item  g,  with 
parameters  a , b and  a , by  a person  whose  location  on  the  latent  trait  con- 

3 0 3 

tinuum  is  given  by  the  quantity  0 (theta) . The  exponential  operator  (exp)  indi- 
cates that  the  quantity  in  parentheses  is  an  exponent  of  the  constant  e=2. 71828. 

Figure  1 shows  a graph  of  the  function  P (0)  in  the  interval  from  0=-3.OO  to 

P 

0=+3.OO  for  an  item  having  a^= 2.0,  b^-0.0,  and  c^=.00.  This  graph  was  generated 

by  evaluating  P (0)  at  61  points  along  the  theta  continuum.  The  irregularities 

visible  in  Figure  1 result  from  rounding  P (0)  to  the  nearest  .02  for  plotting 
purposes.  ® 


Figure  1 

Response  Characteristic  Curve  (a=2.0,  i>=0.0,  c=.00) 


CM  ft*  W CM  CM  — — 


The  item  parameter  c is  the  value  of  P (0)  when  0=-oo.  It  is  the  lower 
asymptote  of  P^(0)  and  is  usually  conceived  of  as  the  probability  of  a keyed 
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response  occurring  "by  chance"  when  0=-°°.  The  item  parameter  bg  is  known  as  the 

item  location  parameter;  it  indicates  the  location  on  the  latent  trait  continuum 
at  which  (0)  is  equal  to  .5(1 +eg  ) • The  item  parameter  Og  is  known  as  the  item 

discrimination  parameter.  It  is  related  to  the  slope  of  the  response  charac- 
teristic curve  and  in  this  model  is  equal  to  the  reciprocal  of  the  distance  that 

one  must  move  along  the  theta  continuum  in  order  to  increase  P (0)  from  .5(1 +ca) 

G y 

to  approximately  (.  8455  (1-c^)  )+c?^.  Since  a^=2.0  and  c-g=.OQ  in  Figure  1,  the 
distance  between  the  locations  on  the  theta  continuum  at  which  P^(0)=. 5 and 

P (0)=.84  is  equal  to  )/a  =.  50  theta  units. 

G G 

Figure  2 shows  a response  characteristic  curve  for  an  item  having  a =1.0, 

G 

ba= 0.0,  and  c =.00.  The  reduced  value  of  a , relative  to  Figure  1,  is  reflected 

G G G 

in  the  shallower  slope  of  this  graph  and  in  the  fact  that  the  distance  between 
the  locations  at  which  P^(0)  = . 50  and  P^(0)=.84  is  now  equal  to  l/a^=1.00  theta 

Figure  2 

Response  Characteristic  Curve  (a=1.0,  b= 0.0,  c=.00) 
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units.  A value  of  a in  the  vicinity  of  1.0  is  typical  of  many  test  items. 

C7 

Values  of  a _ below  about  .5  are  indicative  of  "poor"  items  and  values  of  a 

y cf 

above  2.0,  while  desirable  in  many  applications,  are  not  common. 
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Figure  3 shows  a response  characteristic  curve  for  an  item  having  a = 1.0, 

b =0.0,  and  a =.20.  The  value  c =.20  might  be  applicable  to  a multiple-choice 

9 9 9 

test  item  that  has  five  response  alternatives.  In  accord  with  the  definitions 
given  above,  b is  equal  to  the  location  at  which  P (0)= . 5 ( 1+. 2)=. 60  and  a is 

equal  to  the  reciprocal  of  the  distance  from  the  location  at  which  Pg( 9)=.60  to 

the  location  at  which  Pg(Q)= ( . 8455 ( 1- . 2) )+. 2= . 88 . Note  that  one  of  the  effects 

of  a non-zero  Cg  is  to  reduce  the  slope  of  P^(0)  at  all  points  along  the  theta 

continuum. 


Figure  3 

Response  Characteristic  Curve  (a=1.0,  fr=0.0,  c=.20) 


The  Concept  of  "Information" 

Birnbaum  (1968)  has  discussed  the  concept  of  "information"  available  in  a 
test  item.  Birnbaum' s item  information  function  is  given  in  Equation  2. 

P(0,  ug)  = [P'(&)  ]2 /[Pg(Q)  Qg{B)]  [2] 

In  this  equation,  u_  is  the  item  response  variable.  It  is  equal  to  1 when  a 
keyed  response  is  emitted  and  is  equal  to  0 otherwise.  The  quantity  <?(7(0)  is 
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equal  to  1-P  (0).  The  numerator  of  Equation  2 is  the  squared  first  derivative 

y 

(i.e.,  the  squared  slope)  of  P (0)  at  a fixed  value  of  0.  The  denominator  is 

y 

the  variance  of  the  item  response  variable,  u , at  a fixed  value  of  0.  The 

y 

quantity  I(Q,u ) is  an  index  of  the  item's  ability  to  discriminate  people  whose 

y 

latent  trait  location  equals  0 from  people  at  nearby  latent  trait  locations. 


In  general,  a steeper  slope  for  P (0)  implies  greater  discriminating  power. 

y 

As  was  noted  earlier,  high  values  of  a and  low  values  of  a increase  the  slope 

y y 

of  P^(0)  and,  hence,  the  information  available  from  an  item.  The  variance  of 
Ug  approaches  zero  at  latent  trait  levels  that  are  deviant  from  bg  and  reaches 
its  maximum  value  at  the  latent  trait  level  where  P^(0)=. 5.  Figure  4 shows  a 
graph  of  the  function  1(0, u ) in  the  interval  from  0=-3.OO  to  +3.00  for  the  item 

y 

shown  in  Figure  2,  which  has  a =1.0,  b =0.0,  and  a =.00.  This  graph  was  generated 

g g g 

by  evaluating  1(0, u ) at  61  points  along  the  theta  continuum  and  rounding  the 

G 

obtained  values  to  the  nearest  .02. 


Figure  4 

Information  Curve  for  a Single  Item  (a=1.0,  b= 0.0,  c=.00) 


Figure  4 shows  that  an  item  provides  maximum  information  in  the  region  of 
the  theta  continuum  where  the  item  is  located  (i.e.,  near  b ) and  relatively 
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little  information  at  levels  far  below  or  far  above  bg.  This  result  is  consis- 
tent with  intuitive  impressions  of  item  discriminating  power.  If,  for  example, 
an  ability  test  item  that  was  suitable  for  third  graders  (i.e.,  Pg^O)  near  .5 

among  third  graders)  were  administered  to  college  students  (in  which  group 
P~(0)=1.O),  all  the  college  students  would  probably  answer  it  correctly  and 

no  basis  for  discriminating  among  college  students  would  exist.  Note  that  in 
Figure  4 the  information  curve  is  symmetric  about  b and  attains  a maximum 

value  of  approximately  .72. 

Figure  5 shows  an  information  curve  for  an  item  having  ag=  • 85,  bg= 0.0,  and 

Cg~- 00.  This  curve,  while  still  symmetric  about  bg,  attains  a lower  maximum 

(approximately  .52)  and  falls  off  more  gradually  on  either  side  of  bg  than  the 

curve  in  Figure  4.  In  fact,  the  item  represented  in  Figure  5 provides  slightly 
more  information  than  the  item  represented  in  Figure  4 in  the  interval  below 
0=-1.4O  and  in  the  interval  above  0=1.40.  However,  the  gain  in  these  regions 
is  slight  compared  to  the  information  loss  in  the  interval  -1.40  <•  0 <•  1.40. 


Figure  6 shows  an  information  curve  for  an  item  having  a^=1.0,  i>^=0.0, 

and  oQ=.20.  This  curve  is  not  symmetric  about  bg.  It  attains  its  maximum 

value  of  about  .50  near  0= .16.  The  curve  falls  off  more  rapidly  on  the  left 
of  0=.16  than  on  the  right.  This  reflects  the  fact  that  "chance"  keyed  res- 
ponses are  more  prevalent  among  people  located  below  bg  than  among  people  located 

above  bg.  Such  "lucky"  responses  contribute  error  to  the  estimation  of  latent 

trait  status  and  reduce  the  amount  of  information  available.  Note  that  the 
information  curve  in  Figure  6 is  lower  than  the  curve  in  Figure  5.  Introducing 
the  possibility  of  "lucky"  keyed  responses  reduces  the  information  available 
from  an  item  just  as  if  it  were  an  item  with  lower  a^,,  but  with  <5^=.00. 

Sequential  Estimation  in  an  Adaptive  Test 


In  order  to  demonstrate  the  sequential  estimation  of  latent  trait  status 
in  an  adaptive  test,  a computer  program  was  used  to  simulate  the  test  responses 


of  a person  whose  latent  trait  location  is  0=+l.O. 


Twenty  items  having  a= 1.0 


and  Cg=. 20  were  administered. 


The  items'  bg  values  changed  as  a function  of 


responses  generated  during  the  simulated  test.  Table  1 summarizes  the  results 
of  this  20-item  test. 


The  first  column  in  Table  1 contains  item  numbers  in  the  20-item  series 
(<7=1,2,  . . .,20).  The  second  column  contains  the  bg  values  of  the  items 

administered.  The  difficulty  of  the  first  item  was  b-t=  0 because  this  value 
approximates  the  mean  latent  trait  score  in  any  population  of  persons  that  is 
sampled  to  parameterize  a set  of  test  items.  (An  exception  to  this  may  be 
found  in  Wright  and  Panchapakesan ' s (1969)  implementation  of  the  Rasch  model. 
They  scale  the  latent  trait  metric  such  that  the  mean  of  the  bg  estimates  is 
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Figure  5 

Information  Curve  for  a Single  Item  (a=.85,  £>=0.0,  <?=.00) 
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Figure  6 

Information  Curve  for  a Single  Item  (a=1.0,  b= 0.0,  c=. 20) 


zero  and  the  mean  0 estimate  among  persons  is,  in  general,  other  than  zero.) 
Following  the  first  item,  bg  values  either  increase  or  decrease  (in  accordance 

with  a procedure  to  be  outlined  below)  depending  on  whether  a keyed  or  non-keyed 
response  was  generated.  The  item  response  variable  u is  shown  in  the  third 
column  of  Table  1.  ^ 


Table  1 


Sequential  Estimation  of  Latent 
in  a 20-Item  Adaptive 

Trait 

Test 

Status 

Item 

No. 

MAXL 

Diff.  Resp.  Est. 

WBL 
Est . 

SBAYES 
Est . 

OBAYES 
Est . 

1 

0 

1 

5.49 

1.61 

.38 

.38 

2 

1.00 

0 

.36 

-.85 

.05 

.04 

3 

0 

1 

.67 

.18 

.32 

.31 

4 

.18 

1 

.89 

.82 

.53 

.54 

5 

.82 

1 

1.16 

1.25 

.75 

.78 

6 

1.25 

0 

.87 

.72 

.57 

.56 

7 

.72 

1 

1.03 

1.00 

.74 

.75 

8 

1.00 

1 

1.20 

1.21 

.89 

.93 

9 

1.21 

0 

.99 

.93 

.74 

.74 

10 

.93 

1 

1.12 

1.10 

.87 

.89 

11 

1.10 

0 

.95 

.89 

.73 

.72 

12 

.89 

1 

1.05 

1.02 

.84 

.84 

13 

1.02 

0 

.91 

.85 

.72 

.70 

14 

.85 

1 

.99 

.96 

.82 

.80 

15 

.96 

1 

1.07 

1.05 

.90 

.90 

16 

1.05 

0 

.96 

.92 

.80 

.78 

17 

.92 

1 

1.03 

1.00 

.88 

.87 

18 

1.00 

0 

.93 

.89 

.79 

.76 

19 

.89 

1 

.99 

.96 

.86 

.84 

20 

.96 

1 

1.05 

1.03 

.92 

.92 

Likelihood-based  estimation.  The  last  four  columns  of  Table  1 contain  four 
different  estimates  of  latent  trait  status  that  were  calculated  after  each  item 
was  administered.  The  fourth  column  of  Table  1 contains  maximum-likelihood 
estimates  of  0.  A maximum-likelihood  estimate  of  0 corresponds  to  the  latent 
trait  location  at  which  the  observed  pattern  of  item  responses  has  the  maximum 
probability  of  occurrence.  The  probability  of  a set  of  item  responses,  given  some 
fixed  value  of  0 and  the  item  parameters,  is  obtained  using  the  likelihood  function 
given  in  Equation  3. 

LX 6)  = n [P(Q)US  QXQ)l~U9]  [3] 

v g g g 

This  equation  assumes  that  the  responses  of  a given  person  to  different  test  items 
are  independent  of  one  another.  The  operator  IT  indicates  that  a serial  product  is 
to  be  taken  over  the  test  items  administered  up  to  that  point  (g=l , 2 , . . . k) . 
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After  each  Item  was  administered.  Equation  3 was  evaluated  at  101  equally 
spaced  0 values  in  the  interval  from  0=-5.OO  to  0=+5.OO  and  the  largest  of  the  101 
likelihood  values  was  identified.  Then,  a quadratic  function  was  fitted  to  this 
largest  likelihood  value  and  the  two  likelihoods  adjacent  to  it.  The  value  of  0 
corresponding  to  the  maximum  of  the  quadratic  function  was  used  as  the  "MAXL" 
estimate.  Under  most  conditions,  the  estimate  of  0 obtained  in  this  manner  is 
a good  approximation  to  the  estimate  that  would  be  obtained  if  more  sophisticated 
methods  of  numerical  analysis  were  used  to  search  for  a root  of  the  log-likelihood 
function's  first  derivative. 

The  interval  between  0=-5.OO  and  0=+5.OO  will  contain  at  least  96%  of  the  0 
estimates  in  any  group  that  is  used  to  parameterize  test  items.  This  is  because 
latent  trait  item  parameterization  procedures  scale  the  theta  metric  such  that  the 
mean  0 estimate  equals  zero  and  the  standard  deviation  among  the  estimates  is  1.0 
(again,  the  Rasch  model  provides  an  exception  to  this  general  result),  and  by 
virtue  of  Tchebychef f ' s inequality  which  states  that  the  proportion  of  cases  which 
fall  more  than  S standard  deviations  from  the  mean  cannot  exceed  (1/S2)  in  any 
distribution  (Hays,  1973,  p.  253).  If  the  distribution  of  0 estimates  is  peaked 
and  unimodal,  virtually  all  of  the  0 estimates  will  be  between  -5.00  and  +5.00. 


Figure  7 

Relative  Likelihood  and  Posterior  Probability  Curves  After  1 Item 
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Figures  7,  8,  and  9 show  graphs  of  the  data  likelihood  function  in  the 
interval  from  0=-3.OO  to  0=+3.OO  following  the  administration  of  1,  2,  and  3 items, 
respectively.  For  plotting  purposes,  the  raw  likelihood  values  were  expressed 
relative  to  the  largest  likelihood  value  in  the  interval  0=-5.OO  to  0=+5.OO  and 
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Figure  8 

Relative  Likelihood  and  Posterior  Probability  Curves  After  2 Items 
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Figure  9 

Relative  Likelihood  and  Posterior  Probability  Curves  After  3 Items 
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then  rounded  to  the  nearest  .02.  As  can  be  seen  in  Equation  3,  after  one  item  is 
administered  the  likelihood  function  corresponds  to  either  P^(0)  or  Q^(Q) , depending 

on  whether  a keyed  or  non-keyed  response  is  emitted  (compare  Figure  7 and  Figure  3). 
The  MAXL  estimate  after  a "correct"  answer  to  the  first  item  is  +5.49.  Actually, 

since  P (0)  is  strictly  increasing  in  0,  the  estimate  should  be  0=+°°,  but  a finite 

9 

estimate  is  certainly  more  reasonable.  After  an  "incorrect"  answer  to  the  second 
item,  with  £><,  = 1.00,  the  peak  of  the  likelihood  curve  occurs  near  0=+.36  (Figure  8). 

After  the  third  item,  the  peak  occurs  near  0=.67  (Figure  9). 

"Weighted-by-likelihoods"  (WBL)  estimates  of  latent  trait  status  appear  in 
the  fifth  column  of  Table  1.  The  WBL  estimates  were  obtained  by  taking  a weighted 
average  of  101  equally  spaced  0 values  in  the  interval  from  0=-5.OO  to  0=+5.OO. 

The  weights  used  were  the  data  likelihoods  at  each  0 value.  That  is, 

WBL  Est.  = [UL  (0)  0)  ]/[E(L  (0))  ] [4] 

0 y 0 v 

where  0 takes  on  the  values  -5.00,  -4.90,  ...,  +5.00.  The  WBL  estimate  is  influ- 
enced by  the  entire  set  of  101  likelihood  values  instead  of  just  the  maximum  of 
the  likelihood  function. 


The  MAXL  and  WBL  estimates  can  differ  considerably  when  only  a few  items  have 
been  administered,  as  can  be  seen  in  Table  1.  Inspection  of  the  relative  likeli- 
hood curve  in  Figure  8 shows  why  these  two  estimators  differ  after  two  items  have 
been  administered.  The  WBL  estimate  is  lower  due  to  the  fact  that  the  left  tail 
of  the  likelihood  curve  is  high  relative  to  the  right  tail.  Table  1 also  shows 
that  the  MAXL  and  WBL  estimators  become  more  similar  as  the  number  of  items  admin- 
istered increases.  Since  the  WBL  estimator  has  not  been  proposed  previously, 
future  research  is  planned  to  study  its  characteristics. 


The  procedure  by  which  item  b^  values  were  determined  during  the  simulated 

test  now  can  be  outlined.  The  general  rule  followed  was:  Let  the  next  item  have  a 

difficulty  level  equal  to  the  current  value  of  the  WBL  estimator,  except  that  in  no 
case  shall  the  new  value  be  more  than  1.00  units  from  the  immediately  preceding 

b value.  Thus,  as  can  be  seen  in  Table  1,  item  difficulties  changed  by  1.00 

until  the  third  item  had  been  administered  and  the  WBL  estimate  was  .18.  After 
this,  each  item  difficulty  corresponded  to  the  value  of  the  WBL  estimate  following 
the  preceding  item.  In  actual  practice,  an  item  is  seldom  found  with  b exactly 

equal  to  the  current  estimate  of  latent  trait  status.  In  such  cases,  an  item  that 

has  b close  to  the  desired  value  is  selected  for  administration. 

9 


Bayesian  estimation.  Columns  six  and  seven  of  Table  1 contain  Bayesian 
estimates  of  latent  trait  status.  Given  a specified  form  for  the  continuous  distri- 
bution of  latent  trait  scores  in  a population  (i.e.,  the  prior  probability  density 
function  of  theta),  the  item  parameters  for  the  items  administered,  and  a vector 
of  item  responses  ( u values),  it  is  possible,  in  principle,  to  derive  the 

posterior  probability  density  function  of  theta  using  the  inverse  probability  rule 
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of  Bayes  (Hays,  1973,  p.  819).  In  practice,  it  becomes  difficult  to  obtain 
analytic  expressions  for  the  posterior  theta  distribution  unless  the  prior  distrib- 
ution and  the  data  likelihood  function  take  on  certain  restricted  forms.  To  avoid 
such  difficulties,  the  following  approximate  procedure  can  be  used. 

First,  the  continuous  prior  density  function  of  theta  is  approximated  with  a 
discrete  probability  distribution  in  which  the  probabilities  are  concentrated  at 
101  equally  spaced  points  along  the  theta  continuum.  Thus,  for  example,  the  area 
under  the  prior  density  curve  between  0=-.O5  and  0=+.O5  is  assigned  to  the  point 
0=.OO.  This  is  done  for  0=-5.OO,  -4.90,  ...,  +5.00.  Areas  beyond  0=-5.O5  and 
0=+5.O5  are  assigned  to  the  points  0=-5.OO  and  0=+5.OO,  respectively.  (These 
extreme  tail  areas  should  be  trivially  small.  If  they  are  not,  the  region  of 
the  theta  continuum  in  which  the  procedure  is  applied  can  be  shifted  or  extended.) 
Next,  data  likelihoods  are  generated  at  the  same  101  values  of  0 using  Equation  3. 
The  prior  probabilities,  /(0)  , and  the  data  likelihoods,  Ay(0),  are  then  entered 

into  into  Equation  5 in  order  to  determine  the  posterior  probability  of  each  given 
0 value. 

P(6|u)  4 [L  (0)  /(6)]/E[L  (0)  /(0)]  [5] 

v 0 v 


The  resulting  101  posterior  probabilities  provide  a discrete  approximation  to 
the  continuous  posterior  distribution  of  theta.  Finally,  the  mean  of  the  discrete 
posterior  distribution  is  obtained  with  Equation  6 and  this  value  is  referred  to 
as  the  "SBAYES"  (simplified  Bayesian)  estimate  at  that  stage  of  the  testing 
procedure . 


SBAYES  Est. 


= E [P(0 | v) 


[6] 


SBAYES  estimates  of  0 appear  in  column  six  of  Table  1.  Figures  7,  8,  and  9 show 
three  of  the  posterior  probability  distributions  that  were  generated  with  the 
SBAYES  procedure  when  the  prior  distribution  of  latent  trait  scores  was  specified 
to  be  a normal  density  function  with  zero  mean  and  unit  variance.  The  first  three 
SBAYES  estimates  in  Table  1 are  the  means  of  these  discrete  distributions. 


The  "OBAYES"  (Owen  Bayesian)  latent  trait  estimates  that  appear  in  column 
seven  of  Table  1 were  obtained  using  a procedure  described  by  Owen  (1975) . While 
Owen  has  described  both  a method  for  estimating  latent  trail,  status  and  a method 
for  selecting  test  items,  only  his  estimation  procedure  was  used  here.  Owen  intro- 
duced his  procedure  in  the  context  of  a three-parameter  normal  ogive  latent  trait 
model.  The  close  similarity  of  this  model  to  the  logistic  model  given  in  Equation  1 
allows  its  application  here. 

The  OBAYES  procedure  has  two  drawbacks.  First,  it  is  limited  to  prior  distri- 
butions that  follow  a normal  density  function.  The  SBAYES  procedure  described 
above  can  accept  any  type  of  prior  distribution.  Second,  the  OBAYES  procedure  is 
order  dependent.  That  is,  if  a set  of  items  is  administered  and  the  item  responses 
are  recorded,  then  the  value  of  the  OBAYES  estimator  will  depend  partly  on  the 
order  in  which  the  items  are  processed  by  the  scoring  procedure.  The  OBAYES  proce- 
dure implicitly  generates  an  updated  prior  distribution  after  each  item  is  scored 
and  then  combines  this  new  prior  distribution  with  the  likelihood  function  for  the 


response  to  the  next  item.  This  in  itself  would  not  make  the  OBAYES  procedure 
order  dependent  but,  in  order  to  simplify  the  mathematics,  Owen  proceeded  as  if 
each  updated  prior  distribution  could  be  described  by  a normal  density  function. 
This  approximation  introduces  a small  amount  of  inaccuracy  into  the  estimation 
process  and  makes  the  procedure  order  dependent.  The  SBAYES  procedure  does  not 
utilize  this  type  of  approximation  and  is  not  order  dependent. 

After  administering  a single  item,  SBAYES  and  OBAYES  estimates  generally  agree 
to  three  decimal  places  when  the  initial  prior  distribution  of  0 is  a normal 
density  function.  Since  the  OBAYES  estimate  is  optimal  in  this  particular  situa- 
tion, this  level  of  agreement  can  be  viewed  as  an  indication  that  very  little 
inaccuracy  is  introduced  by  the  discrete  approximations  in  the  SBAYES  procedure. 
When  more  than  one  item  has  been  administered,  or  when  the  prior  distribution 
specified  for  the  SBAYES  procedure  is  non-normal,  the  two  estimation  methods  will 
not  necessarily  agree. 


Figure  10 

Relative  Likelihood  and  Posterior  Probability  Curves  After  20  Items 


Comparisons  between  likelihood-based  and  Bayesian  estimates.  Figure  10  shows 
the  relative  likelihood  and  posterior  probability  curves  that  resulted  after  20 
items  had  been  administered.  The  likelihood  curve  peaks  near  0=1.05  and  the 
posterior  probability  distribution  has  a mean  of  .92  (see  Table  1).  Both  the 
likelihood  curve  and  the  posterior  probability  curve  have  shifted  to  the  region  of 
the  theta  continuum  near  9=1.00,  and  both  curves  have  become  more  peaked.  In  fact, 
as  test  length  ( k ) approaches  infinity,  both  of  these  curves  approach  a vertical 
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line  (i.e.,  a single-valued  distribution)  located  at  the  value  of  0 that  is 
generating  the  item  responses. 

Note  in  Table  1 that  the  Bayesian  estimates  of  0 tend  to  stay  closer  to  0=.OO 
than  the  likelihood-based  estimates  throughout  the  testing  process.  This  is 
because  Bayesian  estimators  are  "drawn  toward”  the  high  density  region  of  the  prior 
distribution.  This  is  appropriate  when  one's  objective  is  to  minimize  squared 
errors  of  estimation  in  the  population  specified  by  the  prior  distribution. 
Unfortunately,  for  tests  of  moderate  length,  a certain  amount  of  bias  at  the  tails 
of  the  theta  distribution  must  be  accepted  in  order  to  achieve  this  minimization 
(McBride  & Weiss,  1976). 

For  moderate  k,  the  maximum-likelihood  estimator  can  also  be  biased.  However, 
for  a given  value  of  k and  values  of  0 deviant  from  the  high  density  region  of  a 
peaked  prior  distribution,  the  maximum-likelihood  estimator  will  tend  to  be  less 
biased  than  the  Bayesian  estimator.  The  Bayesian  estimator's  bias  can  be  reduced 
by  increasing  k as  the  estimate  of  0 deviates  from  the  high  density  region  of  the 
prior  distribution.  This  can  be  done  readily  in  an  adaptive  testing  situation. 

An  interesting  relationship  exists  between  the  likelihood-based  estimators 
and  Bayesian  estimators.  If  one  applied  the  SBAYES  estimation  procedure  and 
specified  that  the  prior  distribution  of  theta  was  rectangular  in  the  inter- 
val 0=-5.O5  to  0=+5.O5,  then  the  SBAYES  estimate  of  0,  as  determined  by  Equation 
6,  would  be  identical  to  the  WBL  estimator.  Moreover,  the  MAXL  estimate  would 
closely  approximate  the  mode  of  the  Bayesian  posterior  probability  distribution. 
Thus,  all  four  types  of  latent  trait  estimators  that  have  been  presented  here 
can  be  viewed  as  Bayesian  estimators.  The  MAXL  estimator  is  a Bayesian  modal 
estimate  of  0 when  the  implicit  prior  is  restricted  to  a rectangular  form,  the 
WBL  estimator  is  a least-squares  estimate  of  9 when  the  implicit  prior  is 
restricted  to  a rectangular  form,  and  the  OBAYES  estimator  is  a least-squares 
estimate  of  0 when  the  explicit  prior  is  restricted  to  a normal  form.  The 
SBAYES  procedure  is  the  only  one  of  the  four  methods  that  does  not  restrict 
the  form  of  the  prior  distribution.  By  virtue  of  this  flexibility,  the  SBAYES 
estimation  procedure  appears  to  be  the  most  widely  applicable  of  the  four 
methods . 

Total  Test  Information 


Birnbaum  (1968,  p.  454)  has  defined  the  information  function  of  a test  as 

1(0)  = [7] 

This  function  is  the  sum  of  the  constituent  item  information  functions  and 
defines  the  maximum  amount  of  information  that  can  be  extracted  from  a set 
of  items.  The  amount  of  information  actually  extracted  depends  on  how  the 
items  are  scored. 

Information  in  the  adaptive  test.  Figure  11  shows  a graph  of  the  test 
information  function  for  the  20  items  administered  in  the  simulated  adaptive 
test.  It  was  obtained  by  evaluating  Equation  7 at  61  equally  spaced  points 
along  the  theta  continuum  in  the  interval  from  0=-3.OO  to  8=+3.00.  This  curve 
shows  the  maximum  amount  of  information  available  from  these  items.  The  curve 
peaks  near  0=1.00,  thus  indicating  that  this  set  of  items  provides  maximum 
discrimination  among  individuals  whose  latent  trait  locations  fall  near 
0=1.00.  The  maximum  value  of  the  curve  is  about  9.00. 
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Figure  11 

Information  Curve  for  20-Item  Adaptive  Test 


Information  in  two  conventional  tests.  Figure  12  shows  a graph  of  the  test 

information  function  for  a set  of  20  items  having  a =1.0,  c =.20,  and  b values 

g g g 

equally  spaced  in  the  interval  from  -3.00  to  +3.00  (i.e.,  b =-3.00,  -2.68,  -2.37, 

Q 

. . . , +3.00).  This  would  commonly  be  referred  to  as  a "rectangular"  distribution 
of  item  difficulties.  This  test  provides  a fairly  uniform  level  of  information 
across  a broad  range  of  the  theta  continuum.  Unfortunately,  the  level  of  infor- 
mation is  relatively  low.  The  curve  attains  its  maximum  value  of  about  3.20 
in  the  interval  -1.00  <•  0 <•  1.90. 


Figure  13  shows  a graph  of  the  test  information  function  for  a set  of  20 
items  having  a =1.0,  j =.20,  and  b =0.0  for  all  items.  This  is  a "perfectly 

9 9 9 

peaked"  test.  The  shape  of  this  information  curve  is  rather  similar  to  the 
curve  in  Figure  11,  but  it  is  shifted  to  the  left.  The  curve  in  Figure  13 
attains  its  maximum  value  of  9.80  near  0=.16.  At  0=1.00,  the  value  of  this 
information  curve  is  about  5.80. 


Figures  12  and  13  represent  two  rather  idealized  non-adaptive  tests.  Both 
of  these  tests  deliver  less  information  at  6=1.00  than  the  items  selected  by  the 
adaptive  testing  procedure.  What  is  the  implication  of  this  result?  If,  for 
some  practical  purpose,  it  were  necessary  to  order  a testee  with  6=1.00  relative 
to  other  individuals  falling  at  nearby  0 values,  fewer  errors  would  be  made  if 
9 estimates  derived  from  the  adaptive  test's  items  were  used  than  if  estimates 
derived  from  either  conventional  test  were  used. 


INFORMATION  VALUES 
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Figure  12 

Information  Curve  for  20-Item  Rectangular  Test  (-3.0  < b < +3.0) 
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Figure  13 

Information  Curve  for  20-Item  Peaked  Test  (fc=0.0  for  all  items) 
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Summary 


Several  procedures  for  estimating  latent  trait  status  have  been  presented. 

It  has  also  been  suggested  that  adaptive  testing  procedures  often  can  provide 
more  accurate  estimates  of  latent  trait  status  than  conventional  tests.  Though 
there  is  no  necessary  connection  between  latent  trait  theory  and  adaptive  testing, 
there  is  a strong  natural  impetus  toward  their  joint  application.  Latent  trait 
theory  provides  adaptive  testing  with  a coherent  theoretical  foundation.  It  is  a 
guide  to  procedures  for  designing  and  scoring  adaptive  tests.  On  the  other 
hand,  adaptive  testing  offers  the  opportunity  to  take  maximum  advantage  of  the 
potentialities  of  latent  trait  theory.  At  this  point  in  time,  both  a new  type 
of  test  theory  and  a new  type  of  testing  technology  are  available.  Their  joint 
effect  might  possibly  exceed  the  sum  of  the  two  parts. 


ADAPTIVE  TESTING  AND  THE  PROBLEM  OF  CLASSIFICATION 


C,  DAVID  VALE 

University  of  Minnesota 


* Two  basic  goals  in  the  use  of  ability  tests  are  measurement  and  classification. 

When  a test  is  used  for  measurement,  the  objective  is  to  accurately  determine  where  a 
testee's  ability  lies  on  the  latent  ability  continuum.  When  a test  is  used  for  class- 
ification, the  objective  is  to  determine  on  which  side  of  a cutting  score  (or  between 
which  cutting  scores)  a testee's  ability  lies.  Such  classification  decisions  should 
be  made  so  as  to  minimize  the  errors  of  misclassification.  Once  a classification  is 
made,  there  is  no  necessity  for  a more  precise  determination  of  an  individual’s 
ability  level. 

This  paper  is  concerned  with  the  classification  of  abilities  into  discrete 
categoi ies . The  general  goals  of  classification  will  be  explicated  and  alternative 
means  that  may  practically  be  used  to  achieve  these  goals  will  be  presented  and 
compared  using  monte  carlo  computer  simulations. 

The  Classification  Problem 
Classification  Errors  and  Utility  Functions 

The  goal  of  this  classification  is  to  determine,  with  a minimal  probability  of 
being  in  error,  on  which  side  of  a cutting  score  or  between  which  of  several  cutting 
scores,  a testee's  ability  falls.  There  are  two  kinds  of  error  probabilities  that 
can  be  examined  in  making  these  classifications.  One  is  the  conditional  probability 
of  being  in  error  (i.e.,  for  a single  testee  or  at  a specific  ability  level);  the 
other  is  the  expected  or  unconditional  probability  of  being  in  error  across  a group  of 
testees.  The  conditional  probability  is  a function  of  the  test,  the  testee's  ability 
level  and  the  placement  of  the  cutting  score  (for  the  moment,  limiting  the  discussion 
to  one  cutting  score).  For  a given  test  of  fixed  length,  the  probability  of  making  an 
error  of  classification  for  a testee  is  usually  high  if  the  testee's  ability  level  (6) 
is  near  a cutting  score  (6  t) , and  lower  if  the  ability  level  is  distant  from  the  cut- 
ting score.  This  conditional  probability  of  misclassification  [ P (M | 0) ] is  described 
by  a function  like  that  shown  in  Figure  14. 

The  unconditional  probability  of  misclassification  for  a group  of  testees 
[P(M)1,  is  a function  of  the  conditional  reliability  function  and  the  distribution 
of  abilities  within  the  group  under  consideration.  For  a large  group  with 
abilities  distributed  N(0,1),  this  probability  is  given  by  Equation  8. 


I 

ji 


P(M) 


P(M|0)  <f>  (0)  d0 


where  4>(0)  = [2Tr*exp(02)  ] 


[8] 


In  practical  situations,  it  may  be  desirable  to  minimize  the  quantity  in 
Equation  8.  This  unconditional  probability  is  a scalar  quantity  and  as  such  can  be 
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minimized.  A function  such  as  the  conditional  probability  function  can  only  be 
minimized  at  a single  point  and  this  is  typically  of  little  practical  value 
because  theoretically,  assuming  a continuous  distribution  of  ability,  the  proba- 
bility of  anyone  having  an  ability  at  that  point  is  zero. 


Figure  14 

A Conditional  Probability  of  Misclassif ication  Curve 
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A more  viable  approach  to  making  classification  decisions  is  one  that  will, 
over  a group  of  individuals,  maximize  some  form  of  utility  such  as  the  quality  of 
performance  extracted  from  the  work  force.  The  unconditional  probability  of 
misclassif ication  reflects  errors  of  classification  into  categories  along  a latent 
continuum  and  it  may  be  errors  of  classification  along  an  observable  success-failure 
continuum  that  are  of  interest.  This  possibility  is  important  because  two  indi- 
viduals, one  with  an  ability  level  slightly  above  a cutting  score  on  the  latent 
continuum  and  the  other  with  ability  slightly  below  the  cutting  point,  probably 
have  a trivial  difference  between  their  probabilities  of  success  on  a job.  If 
both  are  classified  above  the  cutting  score,  however,  one  will  be  considered  a 
"hit"  and  the  other  a "miss"  when  classification  occurs  on  the  latent  continuum. 

In  order  to  assess  the  practical  value  (i.e.,  cost  effectiveness)  to  an  organiza- 
tion of  an  adaptive  testing  strategy,  utility  functions  of  0 for  each  decision 
must  be  specified.  As  an  example  of  such  utility  functions,  consider  the  following: 
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For  three  classifications — low,  middle,  and  high — three  utility  functions 
might  be: 


U1  = ■ 
low 

.5 

[9] 

^medium 

= $(3.0(940.7)) 

[10] 

Uhigh 

2.0 ($(3.0 (6-0. 7))) 

[11] 

where  $(x)  =f  ( t ) d t 

J -OO 


A practical  situation  in  which  these  utility  functions  might  arise  is  as 
follows:  There  are  three  jobs  requiring  an  ability,  6.  One  is  so  easy  that  almost 

anyone  can  do  it  but  when  performed  satisfactorily,  it  is  only  .5  utility  units  of 


Figure  15 

Conditional  Utilities  for  each  of  Three  Decisions 


value  to  the  organization.  A second  job  is  fairly  easy  and  50%  of  people  with  0 
above  -.7  can  perform  it  satisfactorily.  Differences  in  ability  near  -.7  make 
greater  changes  in  the  probability  of  success  than  do  differences  around,  say, 


-27- 


0=0.0.  Ninety-eight  percent  of  people  with  0 above  0.0  will  be  successful  on  the 
job  and  additional  increments  in  0 are  of  little  importance  in  predicting  job 
success.  Success  in  this  job  is  worth  one  unit  of  value.  A third  job  requires 
higher  0 to  be  successful,  but  is  worth  two  units  of  value  when  performed  satis- 
factorily. The  utility  functions  defined  by  Equations  8,  9,  and  10  result  in  the 
three  utility  curves  presented  in  Figure  15.  As  can  be  seen,  there  is  a clear 
reason  for  assigning  high  0 people  to  the  third  job  and  lower  0 people  to  the 
second  and  first  jobs. 

Test  Design  for  Classification  Problems 

Although  it  may  be  possible  to  determine  that  quantity  (e.g.,  probability 
of  misclassif ication  or  expected  utility)  which  is  to  be  minimized  or  maximized, 
it  is  difficult  to  design  a test  explicitly  for  that  purpose.  The  goal  of  optimal 
test  design  can  be  approached  practically  via  one  of  several  approximation  stra- 
tegies. Two  general  types  of  testing  strategies  that  have  been  researched  in  the 
ability  measurement  domain  are  the  conventional  testing  strategy  and  the  adaptive 
testing  strategy.  In  the  former,  test  items  are  selected  to  best  measure  the 
abilities  of  members  of  a group,  and  the  same  test  is  given  to  everyone.  In  the 
latter,  a test  is  tailored,  during  the  testing  process,  to  each  individual's  level  of 
ability,  and  a different  test  may  be  given  to  each  person.  This  permits  higher 
measurement  precision  over  most  of  the  ability  continuum  than  that  attained  with 
a conventional  test. 

In  the  remainder  of  this  paper,  two  forms  of  a conventional  test  and  one  form 
of  an  adaptive  test  will  be  compared.  The  conventional  tests  will  be  a unimodally 
peaked  test  with  all  item  difficulties  of  one  value  and  a bimodally  peaked  test 
(i.e.,  the  simplest  form  of  a multimodally  peaked  test)  with  difficulties  of  two 
values.  As  will  be  discussed  later,  these  are,  respectively,  attempts  to  put 
items  at  a level  where  they  best  measure  most  people  or  at  a level  where  people 
need  to  be  measured  best.  The  adaptive  test  to  be  compared  will  be  Owen's  (1975) 
Bayesian  strategy.  This  strategy  starts  with  some  estimate  of  an  individual's 
ability,  chooses  an  appropriate  item,  administers  the  item,  and  forms  a new 
estimate  of  the  individual's  ability.  Using  this  estimate,  it  chooses  the  next 
item  and  continues  this  procedure  until  the  end  of  the  test. 

These  strategies  will  be  compared  along  the  criteria  previously  discussed. 

Since  utility  functions  are  peculiar  to  an  organization,  the  majority  of  the 
comparisons  will  be  in  terms  of  misclassif ication  probabilities.  The  utility 
functions  presented  above  will,  however,  be  discussed  as  examples  in  some  later 
comparisons. 

Simulation  Procedures  ^ 


The  comparisons  presented  in  this  paper  assume  that  classification  decisions 
are  made  in  the  following  way: 

1)  A testing  strategy  selects  a subset  of  items  from  a large  pool  of  items; 

2)  These  items  are  then  administered  to  a testee,  and  from  his  responses 
to  those  items  an  estimate  of  ability  level  is  obtained; 
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3)  The  testee  is  then  classified  into  that  category  which: 

a)  in  the  case  where  probability  of  misclassif ication  is  of  interest, 
is  the  one  in  which  his  estimated  ability  falls,  or 

b)  in  the  case  where  utility  maximization  is  of  interest,  is  the  one 
which  for  his  estimated  ability  predicts  the  highest  utility. 

To  simplify  the  analyses  and  interpretations,  availability  of  an  infinitely 
large  item  pool  was  assumed.  This  pool  contained  items  of  all  difficulties  with 
their  discriminating  powers  fixed  at  a constant  level.  It  was  further  assumed  that 
these  items  could  not  be  correctly  answered  by  guessing.  These  assumptions  reduced 
the  problem  of  item  selection  to  determining  the  difficulty  of  the  next  item  to  be 
administered  in  the  adaptive  test.  Finally,  to  make  a determination  of  the 
unconditional  probability  of  misclassif ication  possible,  ability  was  assumed 
distributed  N(0,1). 

Owen's  (1975)  Bayesian  testing  procedure  requires  a prior  estimate  of  a 
testee' s ability  to  administer  and  score  a test.  For  all  data  presented  in  this 
paper,  a fixed  prior  ability  distribution  which  was  N(0,1)  was  used  for  all  testees. 
Owen's  scoring  procedure  was  used  to  score  the  conventional  tests  and  again  a N(0,1) 
prior  was  used. 

Generation  of  Misclassif ication  Probabilities  and  Expected  Utilities 

Conditional  probability  of  misclassif ication  was  calculated  for  each  of  30 
values  of  9 equally  spaced  between  0=-1.45  and  0=1.45.  The  simulation  procedure 
followed  that  described  by  McBride  and  Weiss  (1976)  or  Vale  and  Weiss  (1975). 
Ten-item  "tests"  were  administered  to  200  "testees"  at  each  of  30  points.  The  means 
and  standard  deviations  of  the  ability  estimates  were  calculated  at  each  point,  a 
normal  distribution  with  these  parameters  was  determined,  and  the  proportion  of 
that  distribution  falling  outside  the  correct  cutting  score  interval  was  taken  as 
the  probability  of  misclassif ication  at  that  level  of  ability.  These  probabilities 
were  then  visually  fitted  into  the  smooth  curves  shown  in  the  figures. 

To  determine  the  unconditional  probability  of  misclassif ication , ten-item 
"tests"  were  administered  to  2,000  "testees"  with  ability  levels  randomly  sampled 
from  a N(0,1)  population  of  ability  levels  (the  same  sample  of  2000  ability  levels 
was  used  for  all  comparisons) . The  predicted  category  for  individuals  was  the 
score  interval  in  which  their  ability  estimate  fell.  The  true  category  was  the 
interval  in  which  their  true  ability  fell.  An  individual  was  considered  misclass- 
ified  if  the  predicted  category  was  not  the  same  as  the  true  category.  The  number 
of  misclassif ied  individuals  divided  by  2000  was  taken  as  the  unconditional  proba- 
bility of  misclassif ication . 

Expected  utility  was  determined  by  generating  2000  ability  estimates  following 
the  same  procedures  used  in  the  calculation  of  expected  probability  of  misclassifi- 
cation.  The  optimal  decision  to  make  for  an  individual  was  taken  as  the  decision 
corresponding  to  the  utility  function  with  the  highest  value  at  the  estimated  level 
of  ability.  The  actual  utility  was  the  value  of  the  utility  function  corresponding 
to  the  decision  made,  evaluated  at  the  "testee's"  true  level  of  ability.  The 
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expected  utility  was  simply  the  mean  of  these  2000  actual  utility  values.  These 
values  are  reported  only  in  comparisons  of  tests  in  decisions  involving  mere  than 
one  cutting  score. 

Results 


A Single  Cutting  Score 

The  simplest  categorization  situation  to  investigate  is  where  there  is  one 
cutting  score  placed  in  the  middle  of  the  ability  distribution  at  0 =0.6.  The  best 

conventional  test  for  making  this  decision  is  one  with  all  of  its  items  peaked  at 
£>=0.0.  Figure  16  shows  curves  representing  standard  error  of  measurement  functions 


Figure  16 

Standard  Error  of  Measurement  Curves  for  Three  Tests 
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(the  reciprocal  square  root  of  the  information  functions)  for  three  ten-item  tests 
with  a=2.0;  a peaked  conventional  test  with  all  items  having  b=0.0,  an  ideal 
adaptive  test  with  all  items  having  £>=9,  and  a practical  adaptive  test  with  items 
having  difficulties  at  the  estimated  ability  level  at  each  stage.  The  conventional 
test  provides  a low  error  level  at  0=0.0,  but  higher  error  levels  distant  from  that 
point.  The  ideal  adaptive  test  provides  the  same  low  level  of  error  at  all  ability 
levels  but  is  unrealistic  because  in  order  to  implement  it,  it  is  necessary  to  know 
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a testee's  ability  level  before  the  test  is  administered.  A practical  adaptive  test 
provides  a standard  error  function  lower  than  that  of  the  conventional  test  at  abil- 
ity levels  distant  from  0=0.0,  but  relatively  higher  near  0=0.0. 

Assuming  errors  of  measurement  at  a level  of  0 are  distributed  N(0,  SEM2),  the 
probability  of  misclassifying  an  individual  is  given  by  Equation  12. 

0 -0| 

t? 

SEM 

= 1 - <M/I(0)  (6  -0) 2 ] [12] 

a 

where  0 , is  the  cutting  score,  and  1(0)  is  the  test  information 
function  evaluated  at  0. 

It  can  be  shown  from  Equation  12  that  when  6 is  fixed,  P (M | 0 ) is  a monotonic 
increasing  function  of  the  standard  error  of  measurement.  Thus,  the  ordering  of  the 


P(M|0)  = 1 - 4> 


Figure  17 

Conditional  Probability  of  Misclassification,  a=1.0 
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three  testing  strategies  on  P(m|0)  is  the  same  as  their  ordering  on  conditional 
standard  errors  of  measurement  at  any  level  of  0.  It  can  then  be  seen  from  these 
curves  that  a practical  adaptive  test  can  provide  a lower  expected  probability  of 
misclassif ication  if  it  approximates  the  ideal  adaptive  test.  How  well  a given 
adaptive  testing  strategy  approximates  the  ideal  is,  or  course,  an  empirical 
question. 

Figure  17  presents  the  P(MjO)  curves  for  a ten-item  conventional  test,  with 
difficulties  peaked  at  fc=0.0,  and  a ten-item  Bayesian  adaptive  test,  both  with  item 
discrimination  fixed  at  0=1. 0 and  both  scored  by  Owen's  method.  The  curves  appear 
very  similar,  being  high  near  the  cutting  point  (indicating  a high  probability  of 
making  an  error)  and  low  distant  from  the  cutting  point.  The  conventional  test 
allows  somewhat  better  decisions  for  values  of  0 nearer  to  the  cutting  score.  The 
differences  in  the  conditional  probability  of  misclassif ication  function  yield  a 
very  small  difference  between  unconditional  probability  of  misclassif ication  values 
for  the  two  strategies,  which  were  .120  for  the  conventional  test  and  .122  for  the 
Bayesian  test.  (Unconditional  probabilities  are  shown  in  parentheses  beside  the 
legend  in  Figure  17  and  successive  figures.) 


Figure  18 

Conditional  Probability  of  Misclassif ication,  a= 2.0 
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Figure  18  shows  P(M|9)  curves  for  the  same  strategies  with  item  discriminina- 
tions  of  a=2.0.  The  same  general  results  were  obtained,  except  that  the  differences 
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at  values  of  3 distant  from  the  cutting  score  were  more  pronounced,  and  the  rang'! 
of  superiority  of  the  conventional  test  was  smaller.  Due  to  the  N(0,1)  shape  of 
the  ability  distribution,  however,  small  differences  near  the  cutting  point  are  is 
important  in  :he  determination  of  the  expected  probability  of  misclassif ication  .is 
large  differences  distant  from  the  cutting  point.  Difference  in  expected  probabil- 
ity was  still  very  low  (.076  versus  .075). 


Figure  19 

Conditional  Probability  of  Misclassification,  a- 3.0 
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Figure  15  shows  curves  for  tests  with  high  item  discrimination  (a=3.0).  Agtin 
similar  results  were  obtained  and  the  difference  in  expected  probability  of  mis- 
classif icatior  was  still  small  (.052  versus  .054). 


These  results  suggest  that  an  adaptive  test  makes  classification  decisions 
about  as  well  as  a conventional  test  in  this  simple  case  where  a conventional  test 
should  perform  better  in  comparison  to  an  adaptive  test.  However,  it  should  be 
noted  that  the  conventional  test  was  superior  to  the  adaptive  test  in  an  increas- 
ingly narrowei  range  of  0 with  increasing  item  discriminations. 

More  than  One  Cutting  Score 

Design  of  conventional  tests  is  more  complicated,  however,  when 
scores  deviate  from  the  center  of  the  ability  distribution.  A given 
information,  which  corresponds  to  a given  decrease  in  standard  error, 

the  cutting 
increase  in 
has  its 

^ 
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greatest  effect  on  the  conditional  probability  of  misclassif ication  at  ability 
levels  near  a cutting  score.  This  suggests  that  items  should  be  peaked  at  the 
cutting  scores.  But  a given  reduction  in  conditional  probability  of  misclassif ica- 
tion has  its  greatest  effect  on  the  expected  probability  of  misclassif ication  at 
levels  of  ability  where  most  of  the  people  are  located.  This,  assuming  9^N(0,1), 
suggests  peaking  the  item  difficulties  at  b= 0.0.  As  a result,  when  the  cutting 
score  is  at  some  value  of  0 other  than  0.0,  the  two  suggestions  are  in  conflict. 

The  optimal  point(s)  to  peak  the  difficulties  will  be  some  function  of  the  location 
of  the  cutting  scores,  the  discriminating  powers  of  the  items,  and  the  underlying 
ability  distribution.  Determination  of  such  an  optimal  design  of  a conventional 
test  is  beyond  the  scope  of  this  paper.  However,  comparisons  of  some  standard 
conventional  test  designs  with  an  adaptive  test  will  be  informative. 


p(mIb) 


Figure  20 

Conditional  Probability  of  Misclassif ication , a=1.0 
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Assume  that  there  are  two  cutting  scores,  one  at  0 =-.7  and  the  other  at  0 ^.7, 

’ and  that  all  errors  of  misclassif ication  are  equivalent  in  terms  of  importance. 

One  classical  approach  to  designing  a conventional  test  involves  peaking  half  of 
the  items  at  each  of  the  two  cutting  scores,  where  the  fine  distinctions  need  to  be 

* made;  such  a test  can  be  referred  to  as  a bimodal  conventional  test.  Another 
approach  is  to  peak  all  the  items  at  b= 0.0;  this  test  can  be  called  a unimodal 
conventional  test. 

I 
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Figure  3 20  through  22  present  the  conditional  probabilities  of  misclassif ica- 
tion  for  ea:h  of  the  unimodal  and  bimodal  conventional  tests,  and  the  Bayesian 
adaptive  test,  at  three  levels  of  item  discrimination.  Figure  20  shows  the  curve: 
for  the  case  when  a=1.0.  There  is  little  suggestion  in  Figure  20  as  to  which 
strategy  is  tetter.  But  an  interesting  discontinuity  is  observed  for  estimates 
from  all  testing  strategies  at  the  cut  points.  This  characteristic  is  due  to  the 
fact  that,  for  finite-length  tests  (which  include  10-item  tests  like  those  used 
here),  the  Owen's  Bayesian  score  is  biased  (i.e.,  the  expected  value  of  the  score 
at  a given  level  of  9 is  not  0).  Specifically,  in  this  case,  the  Bayesian  score  :s 
biased  in  the  vicinity  of  the  cutting  scores  toward  the  center  of  the  population 
ability  distribution  at  9=0.0.  This  causes  more  testees  to  be  classified  into  th< 
middle  interval  than  would  be  by  an  unbiased  score.  The  effect  is  that  fewer  err:  rs 
of  classification  are  made  for  ability  levels  in  the  middle  interval  and  more  are 
made  for  individuals  in  the  two  extreme  intervals.  Comparing  expected  probabilit: es 
of  misclass  Ification,  the  adaptive  test  yields  the  lowest  probability  (.197)  and 
the  bimodal  conventional,  the  highest  (.224). 


I 
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Figure  21 

Conditional  Probability  of  Misclassif ication,  a=2.0 
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It  is  difficult  to  say  in  this  case,  however,  whether  the  adaptive  test 
provides  a lower  expected  probability  of  misclassif ication  because  it  makes  better 
decisions  oj  oecause  it  is  conservative.  The  conservatism  results  in  more  classifi- 
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cations  errors  in  the  extreme  categories,  and  fewer  errors  at  central  ability 
levels  where  more  individuals'  ability  levels  lie. 

When  a= 2.0  (Figure  21),  the  unimodal  conventional  test  shows  pronounced 
discontinuity  suggesting  that  scores  are  too  extreme  near  the  cutting  points.  The 
adaptive  test  provides  the  smallest  conditional  probabilities  of  misclassif ication 
over  most  of  the  ability  range.  It  makes  a few  more  errors  in  the  extreme  intervals 
than  does  the  unimodal  conventional  test,  but  the  unimodal  test's  superiority  is 
offset  by  extreme  error  rates  in  the  middle  interval.  In  terms  of  expected  proba- 
bilities of  misclassif ication,  the  adaptive  test  is  again  superior  [P(M)=.110]. 

With  an  expected  probability  of  misclassif ication  of  .126,  the  bim. dal  conventional 
test,  its  nearest  competitor,  is  expected  to  make  1.15  times  as  many  errors  of 
classification. 


Figure  22 

Conditional  Probability  of  Misclassification,  £=3.0 
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Unimodal  (.16b 
Bimodal  (.085) 
Adaptive  (.072) 


When  a=3.0,  as  shown  in  Figure  22,  the  same  general  results  were  obtained. 
The  expected  probability  of  misclassification  for  the  bimodal  conventional  test 
(.085)  was  1.18  times  as  large  as  that  of  the  Bayesian  adaptive  test  (.072).  It 
should  be  noted,  however,  that  items  this  discriminating  are  rare  in  practice. 


Utility  Comparisons 


It  is  tempting  to  take  these  values  at  this  point  and  say  that  adaptive 
testing  can  greatly  reduce  overall  errors  of  classification  by  up  to  15  percent 
in  a realistic  classification  situation.  But,  as  was  discussed  earlier,  the 
errors  of  classification  presented  thus  far  are  based  on  a latent  ability  contin- 
uum rather  than  an  observable  success-failure  continuum.  Using  the  utility 
functions  presented  earlier  and  choosing  the  decision  yielding  the  highest  expected 
utility  for  the  estimate  of  ability,  average  utilities  for  the  bimodal  conventional 
test  (the  best  conventional  test  in  previous  comparisons)  and  the  Bayesian  test 
were  .808  and  .820,  respectively,  using  the  items  of  a=1.0.  For  the  same  sample 
of  abilities  and  a=2.0,  the  utilities  were  .831  and  .849.  With  a=3.0,  the  values 
were  .855  and  .858.  Whether  these  differences  are  practically  significant  depends 
on  what  these  units  of  utility  mean  in  a particular  context.  But  such  utilities 
(of  which  these  are  only  an  example)  must  ultimately  be  considered  in  determining 
the  comparative  values  of  conventional  versus  adaptive  testing  for  classification 
decisions . 


Conclusions 


These  results  suggest  that  adaptive  testing  may  offer  important  advantages 
to  an  organization  involved  in  making  classification  (e.g.,  selection  and  place- 
ment) decisions.  Specifically,  the  data  show  that  while  a conventional  test 
classifies  as  well  as  an  adaptive  test  when  there  is  one  cutting  score  at  the 
middle  of  the  ability  distribution,  an  adaptive  test  will  provide  better  categor- 
ization when  there  is  more  than  one.  The  determination  of  the  cost  effectiveness 
of  adaptive  testing  in  an  organization,  however,  will  depend  on  the  utility 
functions  specified  by  the  organization. 


APPLICATIONS  OF  ITEM  CHARACTERISTIC  CURVE  THEORY 
TO  THE  PROBLEM  OF  TEST  BIAS 

STEVEN  M.  PINE 

University  of  Minnesota 


One  of  the  most  challenging  and  important  issues  facing  test  developers  and 
users  today  is  whether  or  not  ability  tests  are  biased  against  minority  groups,  and 
if  so,  how  test  bias  can  be  reduced.  In  recent  years,  there  has  been  considerable 
research  activity  concerned  with  the  identification  and  reduction  of  bias  and 
unfairness  in  various  settings.  For  the  most  part,  these  efforts  have  been  unsuc- 
cessful. One  possible  reason  for  this  lack  of  progress  is  the  fact  that  almost 
all  the  research  on  test  bias  and  fairness  has  been  based  on  classical  test  theory. 

In  his  recent  review  of  test  theory,  Lumsden  (1976)  refers  to  the  true  score 
model  of  classical  test  theory  as  the  "Model-T  Theory"  and  suggests  that  classical 
test  theory  reflects  a very  restricted  range  of  test  behavior.  For  example,  class- 
ical test  theory  emphasizes  group-oriented  measurement;  but  group-oriented  measure- 
ment is  likely  to  be  unproductive  if  tests  are  to  be  relevant  to  individuals  of 
varied  backgrounds.  Consequently,  it  is  unlikely  that  this  approach  will  be  useful 
in  resolving  problems  as  complex  as  those  involved  in  test  bias. 

pias  in  testing  is  caused  by  the  failure  of  tests  to  take  into  account  a 
number  of  important  variables  in  their  construction,  administration,  and  scoring 
(Angoff,  1975;  Green,  1976;  Pine  & Weiss,  1976;  Sattler,  1974).  These  variables 
include  individual  differences  in  motivation,  ethnic  background  and  related 
variables . 

Tests  based  on  classical  test  theory  may  ignore  certain  types  of  individual 
differences  because  they  are  constructed  using  item  statistics  which  can  be  expected 
to  vary  between  population  subgroups,  and  because  they  require  all  testees  to  take 
identical  test  items.  If  progress  is  to  be  made  in  this  critical  research  area,  a 
test  theory  that  permits  the  testing  process  to  be  adapted  or  tailored  to  individ- 
uals is  needed.  This  capability  now  exists  in  the  form  of  item  characteristic 
curve  theory,  coupled  with  the  technology  of  adaptive  test  administration. 


An  Item  Response  Model  of  Bias 


Item  characteristic  curve  theory.  Recently,  a new  test  theory  called  "item 
characteristic  curve  (or  latent  trait)  theory,"  specifically  designed  for  the 
measurement  of  individuals,  has  emerged.  Item  characteristic  curve  theory  (Lord  & 
Novick,  1968)  is  based  on  the  idea  that  the  responses  which  individuals  make  to  a 
given  ability  test  item  are  determined  by  their  ability  on  one  or  mere  underlying 
dimensions  (latent  traits),  and  the  parameters  of  the  test  items,  i.o. , their 
difficulty,  discriminating  power,  and  probability  of  being  guessed  correctly  by 
chance.  This  idea  is  expressed  mathematically  by  the  Item  Characteristic  Curve  (ICC) 
which  gives  the  probability  that  a testee  with  a given  ability  level  on  the 
underlying  dimension  will  correctly  answer  a given  test  item. 


This  research  is  supported  by  contract  N00014-76-C-0244 , NR  No.  150-383,  with  the 
Personnel  and  Training  Research  Programs,  Office  of  Naval  Research. 
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The  ICC  curves  and  their  associated  item  parameters  are  the  building  blocks 
of  this  new  test  theory.  Once  item  parameters  are  determined  for  each  test  item, 
they  can  be  used  to  describe  how  individuals  at  a given  ability  level  are  likely 
to  perform  on  each  item.  ICC  theory  allows  probabilistic  statements  to  be  made 
about  the  ability  level  of  testees  regardless  of  their  subgroup  membership  or  which 
subset  of  items  they  have  been  administered.  This  property  provides  a means  for 
creating  tests  which  can  be  adapted  to  individual  testees  since  it  is  no  longer 
necessary  that  identical  items  be  administered  to  every  testee,  thus  making  ICC 
theory  potentially  valuable  for  developing  less  biased  tests.  Furthermore,  the 
bias-reducing  potential  of  ICC  theory  is  not  tied  to  its  use  with  any  particular 
testing  strategy,  although  the  greatest  benefits  can  be  expected  when  it  is  used 
in  conjunction  with  adaptive  testing  (Pine  & Weiss,  1977;  Weiss,  1974)  . 

Definition  of  item  bias.  A test  item  can  be  considered  to  be  unbiased  if  all 
individuals  having  the  same  underlying  ability  level  have  an  equal  probability  of 
correctly  answering  the  item , regardless  of  their  subgroup  membership. 

As  indicated,  the  ICC  gives  the  probability  of  correctly  answering  an  item  at 
a given  ability  level.  Therefore,  the  above  definition  of  an  unbiased  item  is 
equivalent  to  requiring  that  a test  item  have  the  same  ICC  for  all  subgroups. 

Since  an  ICC  is  described  by  its  difficulty,  discrimination  and  guessing  parameters, 
this  is  also  equivalent  to  requiring  that  the  values  of  these  parameters  be  invar- 
iant within  a linear  transformation  from  subgroup  to  subgroup.  The  linear  trans- 
formation assumption  is  necessary  to  account  for  the  fact  that  subgroups  in  which 
the  parameters  are  calculated  may  have  ability  distributions  with  different  means 
and  variances. 

Applying  the  Model  to  Detect  Test  Bias 

The  following  discussion  is  restricted  to  tests  that  consist  entirely  of 
homogeneous  items.  Homogeneity  implies  that  the  items  measure  essentially  one 
ability  dimension.  This  definition  allows  for  the  possibility  that  a homogeneous 
set  of  items  may  measure  one  or  more  extraneous  dimensions  in  addition  to  the  single 
primary  dimension  which  the  test  is  purported  to  measure.  For  instance,  test  items 
intended  to  measure  vocabulary  ability  may  inadvertently  also  measure  several 
cultural  variables.  Although  the  present  discussion  is  restricted  to  homogeneous 
items,  the  concepts  developed  here  could  in  principle  be  extended  to  the  multidi- 
mensional case. 

It  is  also  assumed  here  that  test  items  fit  an  underlying  response  model  for 
all  subgroups.  This  model  is  the  function  which  specifies  the  shape  of  the  ICC 
curve  and  indicates,  at  each  ability  level,  the  probability  that  an  individual 
at  that  level  will  correctly  answer  the  administered  item.  This  constraint  is  not 
as  limiting  as  it  may  appear  to  be,  since  one  can  empirically  test  the  fit  of  the 
item  data  to  the  assumed  response  model  and  eliminate  those  items  that  do  not  fit 
prior  to  carrying  out  any  of  the  analyses  described  here. 

Given  the  above  restrictions,  the  first  step  in  investigating  whether  a set 
of  items  is  biased  is  to  screen  out  those  items  which  do  not  fit  the  underlying 
response  mode] . Most  of  the  existing  computer  programs  for  estimating  item  response 
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parameters  (e.g.,  Urry,  1974a;  Wingersky  & Lord,  1973)  reject  items  that  do  not 
fit  the  assumed  model  as  a matter  of  course.  Therefore,  with  these  programs,  it 
can  be  assumed  that  all  items  for  which  parameter  values  are  available  fit  the 
response  model . 

The  next  step  is  to  demonstrate  that  these  items  are  homogeneous,  i.e.,  the 
same  trait  accounts  for  the  major  portion  of  underlying  variance  in  each  subgroup's 
inter-item  correlation  matrix.  If  they  are  homogeneous.  Lord  and  Novick  (1968, 
pp.  359-360)  have  shown  that  their  item  response  parameters  will  be  invariant 
(within  a linear  transformation)  across  subgroups.  According  to  the  definitions 
given  earlier,  invariant  test  items  are  unbiased.  Therefore,  a sufficient  method 
for  demonstrating  that  a set  of  test  items  is  unbiased  is  first  to  factor  analyze 
the  matrix  of  inter-item  correlation  coefficients  within  each  of  two  or  more  sub- 
groups and  demonstrate  that  the  same  single  factor  accounts  for  the  major  portion 
of  variance  in  each  subgroup's  matrix,  and  then  show  that  this  is  the  factor  that 
the  test  was  intended  to  measure. 


Figure  23 

Item  Bias  Shown  as  a Perpendicular  Distance 
in  a Scatter  Plot  of  Subgroup  Item  Difficulties 


Item  Difficulty  Parameters 
Majority  Subgroup 


A second  approach  for  determining  whether  a set  of  test  items  is  biased  is 
also  implicit  in  the  work  of  Lord  and  Novick.  If  the  same  dimension  underlies  a set 
, of  test  items  for  a population  of  testees  (which  would,  therefore,  make  the  items 

unbiased) , the  item  parameters  for  any  two  subgroups  in  the  population  should  have 

I 
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a linear  relationship  (Lord  & Novick,  1968,  p.  380).  This  condition  can  be  tested 
directly  by  plotting  the  discrimination  (a) , difficulty  ( b ) , or  guessing  (c)  para- 
meters of  a set  of  items  derived  from  one  subgroup  against  those  from  another  and 
and  testing  for  linearity.  A plot  of  this  type,  based  on  the  item  response  diffi- 
culty parameters  for  a 10-item  test,  is  shown  in  Figure  23.  If  factor  analysis 
indicates  that  a single  dimension  underlies  a set  of  items,  the  presence  of  a linear 
relation  between  subgroups  for  ICC  parameters  is  both  a necessary  and  sufficient 
demonstration  that  these  items  are  unbiased. 

In  Figure  23,  the  perpendicular  distance  between  each  item  and  the  best 
fitting  line  through  all  the  points  can  be  interpreted  as  the  degree  of  item  bias; 
the  greater  the  distance,  the  more  item  bias  is  implied.  By  comparing  the  relative 
item  parameter  values  between  subgroups,  it  is  possible  to  identify  the  specific 
test  items  which  contribute  the  most  to  a non-linear  relationship  between  subgroup 
parameters.  In  the  language  of  analysis  of  variance,  this  non-linear  relationship 
would  be  an  item-by-group  interaction.  Plots  similar  to  Figure  23  and  related 
interpretations  could  also  be  made  for  item  discrimination  and  guessing  parameters. 

The  degree-of-item-bias  index  illustrated  in  Figure  23  has  several  applica- 
tions. It  could  be  used  to  screen  out  the  most  biased  items  during  the  construc- 
tion of  a conventional  test.  Or,  it  could  be  used  within  an  adaptive  testing 
framework  as  an  additional  criterion  for  item  selection. 

The  assessment  of  item  bias  by  plotting  a scatter  diagram  of  item  parameters 
for  one  subgroup  against  another  is  not  in  itself  new.  A very  similar  method  has 
been  used  at  Educational  Testing  Service  (ETS)  for  several  years.  The  essential 
difference  between  the  present  method  and  the  ETS  method  is  that  ETS  uses  item 
parameters  based  on  classical  test  theory.  It  can  be  shown  (Lord  & Novick,  1968, 
p.  301)  that  classical  item  parameters  will  generally  not  be  linearly  related  across 
subgroups  of  a population.  This  means  that  the  test  for  bias  using  classical 
parameters  can  lead  to  an  artif actual  detection  of  bias.  Furthermore,  the  diffi- 
culty parameter  of  classical  test  theory  is  confounded  by  level  of  discrimination 
and  guessing  effects  (Urry,  1974b).  Thus,  if  an  item  is  relatively  more  difficult 
for  one  subgroup  than  another,  it  is  not  clear  whether  this  is  because  the  item 
varies  only  on  difficulty,  or  whether  this  result  is  caused  by  differences  in 
discrimination  and/or  guessing.  The  item  parameters  from  ICC  theory,  on  the  other 
hand,  provide  relatively  unconfounded  measures  of  difficulty,  discrimination,  and 
guessing.  Therefore,  by  plotting  these  parameters  on  separate  graphs,  it  is 
possible  to  determine  exactly  why  an  item  is  biased.  For  instance,  it  may  be  that 
a given  item  is  biased  not  because  it  is  relatively  more  difficult  for  a minority 
subgroup,  but  because  that  subgroup  is  less  effective  at  guessing.  This  kind  of 
detailed  analysis  is  impossible  using  classical  item  parameters. 

Another  interesting  consideration  in  the  use  of  ICC  versus  classical  item 
parameters  is  the  fact  that  if  classical  item  parameters  are  linearly  related  among 
subgroups,  thereby  implying  an  unbiased  set  of  items,  ICC  parameters  will  of 
necessity  not  be  linearly  related  and  will,  therefore,  imply  the  presence  of  bias 
in  these  same  items.  This  fact  would  seem  to  have  particular  relevance  for  the 
work  of  researchers  such  as  Jensen  (1975)  who  have  concluded  that  tests  are  gener- 
ally not  biased  against  Blacks  based  on  the  presence  of  a linear  relationship 
between  classical  item  parameters  correlated  across  Black  and  White  subgroups. 
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An  example  with  real  data.  To  demonstrate  how  these  analyses  might  be  used 
and  interpreted,  they  have  been  applied  to  the  difficulty  parameter  from  75 
multiple-choice  vocabulary  items  administered  in  a racially  mixed  high  school  in 
Minneapolis.  The  sample  sizes  in  this  study  were  not  optimal  (58  Blacks, 

168  Whites),  but  the  data  provide  a good  example  of  the  technique. 

First  the  homogeneity  assumption  was  tested  by  factor  analyzing  the  inter- 
item correlation  matrices.  A subset  of  45  items  was  chosen  and  two  tetrachoric 
intercorrelation  matrices  were  calculated,  one  for  the  Black  and  one  for  the  White 
subsamples.  The  matrices  were  then  factor  analyzed  using  the  principal  axis  method 
communalities  were  estimated  using  the  highest  off-diagonal  entry  for  each  item, 
and  the  factor  solution  was  iterated  until  the  estimated  communalities  stabilized. 
Eight  factors  were  extracted  from  each  matrix,  in  each  case  accounting  for  all  of 
the  estimated  common  variance.  The  eigenvalues  from  the  two  factor  analyses  are 
shown  in  Table  2. 


Table  2 

Eigenvalues  from  Factor  Analyses  of  Black  and  White 
Subgroup  Item-Intercorrelation  Matrices 


Percent  of 

Common  Cumulative 

Subgroup  Factor  Eigenvalue  Variance Percent 

Whites 


1 

19.26 

64.8 

64.8 

2 

2.32 

7.8 

72.7 

3 

1.67 

5.6 

78.3 

4 

1.58 

5.3 

83.7 

5 

1.37 

4.6 

88.3 

6 

1.20 

4.1 

92.4 

7 

1.18 

4.0 

96.4 

8 

1.08 

3.6 

100.0 

1 

16.33 

47.9 

47.9 

2 

3.70 

10.9 

58.7 

3 

3.01 

8.8 

67.5 

4 

2.64 

7.7 

75.3 

5 

2.35 

6.9 

82.2 

6 

2.26 

6.6 

88.8 

7 

2.06 

6.0 

94.9 

8 

1.75 

5.1 

100.0 

For  both  the  Black  and  the  White  data,  the  first  eigenvalue  was  very  large  in 
comparison  to  the  remaining  eigenvalues,  providing  evidence  supportive  of  the  uni- 
dimensionality assumption.  Furthermore,  the  items  appear  to  be  measuring  the  same 
dimension  in  both  subgroups,  since  the  coefficient  of  congruence  (Rummel,  1970, 
p.  461)  calculated  between  the  45  corresponding  loadings  for  Factor  1 in  the  two 
subgroups  was  .97.  It  also  seems  reasonable  to  conclude,  based  on  the  pattern  of 
loadings,  that  Factor  1 is  measuring  vocabulary  ability. 


The  results  of  a further  analysis  of  bias  for  these  75  items  are  shown  in 
Figure  24.  The  scatter  plot  in  Figure  24  is  based  on  the  estimated  ICC  difficulty 
parameter  values  calculated  separately  for  the  White  and  Black  subsamples. 


Figure  24 

Graphical  Analysis  of  the  Bias  in  75  Multiple-Choice 
Vocabulary  Items 


X . 


Difficulties  for  Blacks  (b^) 


The  data  plotted  in  Figure  24  show  that  almost  all  of  the  items  are  relatively 
more  difficult  for  Blacks  than  for  Whites.  This  is  indicated  by  the  fact  that  the 
dots  representing  the  items  tend  to  fall  below  the  diagonal  line.  If  the  items  were 
equally  difficult  for  Blacks  and  Whites,  the  data  points  would  fall  on  this  line. 

However,  the  mere  fact  that  the  items  are  relatively  more  difficult  for  Blacks 
cannot  necessarily  be  taken  as  an  indication  of  bias,  since  bias  in  the  test  items 
is  assessed  by  evaluating  the  degree  of  linearity  in  the  plot.  The  Pearson  product- 
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moment  correlation  coefficient  between  the  item  parameter  values  for  Blacks  and 
Whites  is  rl=.86,  indicating  a high  degree  of  linear  relationship.  This  is  consis- 
tent with  the  results  of  the  factor  analysis  and  suggests  that  these  vocabulary 
items,  when  taken  as  a group,  are  essentially  unbiased.  It  is  possible,  however, 
that  even  though  the  items  taken  as  a group  are  unbiased,  one  or  more  of  the  items 
taken  individually  might  be  biased.  For  instance,  in  these  data,  several  items 
appear  to  have  larger  departures  from  the  dotted  line  fitted  through  the  item  points 
in  Figure  24.  Of  course,  it  is  possible  that  these  large  departures  may  be  due  only 
to  sampling  error.  To  eliminate  possible  misinterpretations  that  would  occur  if 
this  were  the  case,  a technique  is  under  development  to  establish  confidence  limits 
for  the  best  fitting  line.  This  technique  will  permit  the  identification,  with 
some  known  degree  of  confidence,  of  biased  items. 

Related  Developments 

The  material  presented  here  is  only  one  example  of  how  item  characteristic 
curve  theory  can  potentially  be  applied  to  the  problem  of  test  bias.  It  is  only 
a small  part  of  the  research  related  to  test  bias  and  unfairness  currently  underway 
at  the  University  of  Minnesota. 

Additional  developments  involve  a method  of  correcting  for  bias  in  the  ICC 
item  parameters.  Very  briefly,  this  method  consists  of  determining  item  parameter 
estimates  that  will  depend  only  on  the  extent  to  which  an  item  loads  on  the  factor 
it  is  supposed  to  be  measuring.  In  essence,  this  approach  is  based  on  the  notion 
that  to  obtain  unbiased  test  items,  all  that  is  necessary  is  to  know  how  each  test 
item  behaves  (i.e.,  what  its  parameters  are)  in  the  various  subgroups  which  comprise 
our  test  population.  Using  the  method  now  under  development,  bias  in  an  item  can 
be  eliminated  by  correcting  its  parameter  values  to  account  for  the  degree  of  bias. 
Then,  if  the  resulting  ability  estimates  are  based  not  on  the  total  number  of 
correct  answers,  but  on  some  function  of  the  corrected  item  parameter  values,  the 
resulting  ability  estimates  will  be  unbiased. 

This  method  for  correcting  item  bias  is  now  being  studied  by  computer  simu- 
lation techniques.  In  this  way,  the  bias-corrected  item  parameter  values  can  be 
directly  compared  to  the  known,  true  item  parameter  values.  If  the  results  of 
these  studies  are  favorable,  the  technique  will  permit  the  reduction  or  elimination 
of  the  effects  of  item  bias  on  ability  test  scores. 

Does  this  mean  that  we  can  now  write  the  final  chapter  on  test  unfairness? 

Not  at  all!  First,  some  may  disagree  that  bias  has  been  eliminated  as  long  as 
differences  exist  in  the  mean  test  scores  of  various  subgroups.  Secondly,  bias 
in  the  estimation  of  item  parameters  is  only  one  source  of  possible  unfairness  in 
the  testing  process.  A test  can  be  unfair  for  a myriad  of  other  reasons,  including 
those  attributable  to  elements  in  the  testing  environment,  and  to  the  psychometric 
properties  of  the  procedure  used  to  select  and  administer  test  items  (Pine  & Weiss, 
1977;  Weiss,  1975).  To  explore  the  possible  psychometric  influences  on  test 
unfairness,  a series  of  computer  simulations  designed  to  investigate  how  item 
characteristics  interact  with  the  choice  of  a testing  strategy  is  currently  in 
progress.  Also  in  progress  is  a live  computerized  testing  study  designed  to 
investigate  how  well  some  of  the  bias-reducing  procedures  described  in  this  paper 
operate  in  a real  test  administration.  This  study  will  also  investigate  a compu- 
terized adaptive  test  designed  explicitly  to  reduce  bias  in  test  scores.  In  addition, 
the  study  is  designed  to  replicate  a previous  finding  that  computerized  tests 
increase  the  test-taking  motivation  of  minority  testees  (Betz  & Weiss,  1976b; 

Weiss,  1976). 
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The  purpose  of  achievement  testing  is  to  locate  individuals  on  an  achievement 
scale.  Usually,  to  interpret  achievement  test  scores,  a transformation  is  applied 
to  the  scores  which  allows  an  interpretation  in  terms  of  the  relative  standing  of 
an  individual  with  respect  to  the  norming  group.  In  many  instructional  settings, 
this  interpretation  is  not  adequate  and,  as  a result,  instructional  personnel 
have  requested  more  concrete  kinds  of  interpretation.  Criterion-referenced 
testing,  mastery  testing  and  similar  approaches  have  been  developed  to  meet 
these  needs. 

What  is  unique  about  criterion-referenced  and  mastery  testing  is  that  the 
items  that  constitute  the  test  are  sampled  from  a population  of  items  which  is 
isomorphic  with  the  objectives  of  the  instructional  program  in  which  achievement 
is  to  be  measured  (Shoemaker,  1975).  Because  of  this,  it  is  possible  to  inter- 
pret scores  in  terms  of  the  specific  areas  of  achievement  that  a student  has 
mastered  in  relation  to  the  objectives  of  the  instructional  program. 

Undoubtedly,  this  attention  to  content  is  bound  to  increase  the  quality 
of  achievement  test  scores.  However,  the  degree  of  improvement  possible  in 
achievement  test  scores  using  any  approach  to  achievement  test  construction  is 
limited  by  the  nature  of  the  test  item.  When  typical  multiple-choice  test 
items  are  used,  a very  limited  range  of  student  performance  is  measured.  The 
cognitive  skills  involved  appear  to  be  the  processes  of  recall  of  information 
coupled  with  recognition  of  the  correct  answer,  and  the  result  is  usually 
expressed  as  either  "correct"  or  "incorrect".  However,  achievement  or  knowledge 
is  seldom  all  or  none,  and  proceeding  as  if  it  were,  as  in  the  typical  "cor- 
rect-incorrect" multiple-choice  achievement  test,  does  not  extract  all  the 
potential  information  about  an  individual's  achievement  level.  This  paper 
describes  research  concerned  with  the  integration  of  testing  procedures  which 
take  partial  information  into  account  with  methods  of  computerized  adaptive 
achievement  test  administration,  and  discusses  some  implications  of  this  re- 
search for  performance  testing. 

Partial  Knowledge 

Background.  Intuitively  it  seems  clear  that  extracting  partial  knowledge 
from  test  responses  should  lead  to  better  assessment  of  achievement.  However, 
the  research  literature  (e.g.,  Wang  & Stanley,  1970)  does  not  show  consistent 
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increases  in  both  reliability  and  validity  when  partial  knowledge  is  taken 
into  account.  The  results  of  the  typical  investigation  (e.g.,  Hakstian  & Kansup, 
1975)  show  that,  while  reliability  is  usually  increased  by  taking  partial  know- 
ledge into  account,  the  validity  of  the  scores  remains  the  same  or  even  dimin- 
ishes. Such  findings  are  usually  interpreted  as  evidence  against  the  useful- 
ness of  the  assessment  of  partial  knowledge.  However,  a careful  consideration 
of  the  problem  suggests  that  something  is  amiss.  One  possible  explanation  is 
that  the  test  and  the  criterion  are  not  unidimensional. 

To  illustrate,  consider  two  tests,  A and  B,  measuring  a single  construct. 

Test  B can  be  referred  to  as  the  "criterion  test";  the  correlation  between  A 
and  B will  be  referred  to  as  the  validity  of  Test  A.  Both  Test  A and  Test  B 
correlate  .60  with  the  construct.  This  can  be  summarized  as  follows: 

Test 

A [13] 

B 

Then  the  intertest  correlation  matrix  can  be  expressed  (Joreskog,  1971;  Max- 
well, 1971)  as  Equation  14. 

E = AA'  + Y2,  [14] 

where  Y2  is  a diagonal  matrix  of  error  variances.  For  the  A in  Equation  13, 
Equation  14  becomes. 
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[15] 


The  off-diagonal  element  of  AA'  is  equal  to  the  validity  of  A and  the 
diagonal  elements  are  reliabilities.  In  this  case  both  A and  B have  reliabilities 
of  .36  and  the  validity  of  Test  A is  .36. 


Now,  suppose  Test  A is  administered  under  conditions  that  allow  for  par- 
tial knowledge  and  that,  as  a result,  its  correlation  with  the  construct  goes 

from  .60  to  .70.  Following  the  same  procedure  shown  in  Equation  15,  the  re- 
liability of  Test  A becomes  .49  while  that  of  Test  B remains  at  .36.  At  the 

same  time,  the  validity  of  Test  A increases  from  .36  to  .42.  In  short,  when 
there  is  a single  common  factor  underlying  the  responses  to  a criterion  and  a 
predictor,  an  increase  in  the  reliability  of  the  predictor  will  lead  to  an 
increase  in  its  validity.  This  is  not  so  when  more  than  one  factor  is  common. 
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To  illustrate  this,  assume  that  Tests  A and  B,  both  administered  conven- 
tionally, have  in  common  a method  factor  (0m),  in  addition  to  the  construct, 
and  that  both  correlate  .40  with  it.  That  is. 


Test 

A 

B 


[16] 


Assuming  that  the  construct  and  the  method  factor  are  uncorrelated,  the 
correlation  matrix  for  Tests  A and  B,  according  to  the  model  in  Equation  14, 
is  given  by: 


E = 
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In  this  case,  the  validity  of  Test  A is  .52. 


Now,  suppose  that  the  same  Test  A is  again  administered  under  conditions 
that  allow  for  the  scoring  of  partial  information  and  that,  as  a result  of 
this,  its  correlation  with  the  construct  becomes  .70.  At  the  same  time  the 
correlation  of  Test  A with  the  method  factor  drops  from  .40  to  .20;  i.e.,  A 
becomes: 


A = 


and 
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[19] 


from  .52  to  .50.  However,  it  is  clear  that  this  seemingly  disappointing  re- 
sult is  not  inconsistent  with  the  true  improvement  that  occurred,  namely  an 
increase  in  the  correlation  of  Test  A with  the  construct. 


Although  this  example  contains  many  assumptions,  it  seems  that  something 
similar  occurs  with  real  data.  Hakstian  and  Kansup  (1975)  compared  the  validity 
of  a verbal  ability  test  administered  under  conventional  and  elimination  scoring 
(Coombs,  Millholland,  & Womer,  1956)  instructions.  Validity  was  defined  as  the 
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correlation  with  school  grades  in  language  arts.  This  correlation  was  .49  under 
conventional  administration  and  .39  under  elimination  scoring.  However,  the 
correlation  with  another  verbal  ability  test  was  .59  under  conventional  scoring 
and  .67  under  elimination  scoring.  Thus,  when  validity  is  defined  as  the  cor- 
relation with  school  grades,  elimination  scoring  appears  to  be  less  valid; 
but  when  validity  is  defined  as  the  correlation  with  another  verbal  ability 
score,  elimination  scoring  is  more  valid.  These  results  are  not  contradictory 
but  simply  provide  evidence  of  the  fact  that  performance  on  verbal  ability 
tests  measured  either  with  multiple-choice  or  elimination  items  is  explained 
by  the  same  ability,  whereas  school  grades  on  language  arts  do  not  depend  ex- 
clusively on  verbal  ability. 

Advantages  of  using  partial  information.  If  methods  for  the  assessment 
of  partial  knowledge  are  to  yield  improved  test  scores,  the  tests  must  be 
such  that  there  will  be  an  opportunity  for  partial  knowledge  to  emerge.  With 
few  exceptions,  most  notably  Coombs  et  al.  (1956),  the  presence  of  partial 
knowledge  is  never  tested.  Some  theoretical  results  suggest  that  when  partial 
knowledge  is  allowed  to  emerge  and  is  scored,  dramatic  improvements  in  test 
scores  follow. 

To  illustrate  this,  consider  the  information  functions  of  two  latent  trait 
models.  Information  at  a given  point  on  the  underlying  trait  is  the  reciprocal 
of  the  variance  of  the  maximum  likelihood  estimator  at  that  point.  Therefore, 
the  larger  the  information  value,  the  more  precise  is  the  estimate  of  the  lo- 
cation of  an  individual  on  the  trait.  One  latent  trait  model  studied  was  the 
two-parameter  normal  ogive  (Lord  & Novick,  1968,  Chap.  16)  which  is  appro- 
priate for  dichotomous  scoring.  The  other  model  was  Samejima's  (1969)  graded 
response  model,  which  is  an  extension  of  the  two-parameter  normal  ogive  model  to 
polychotomous  scoring.  Information  levels  of  the  graded  model  can  be  considered 
to  be  the  case  when  partial  knowledge  is  taken  into  account,  whereas  the  informa- 
tion provided  by  the  dichotomous  model  is  that  provided  when  partial  information 
is  ignored. 

To  simplify  the  comparison,  the  mean  information  for  each  model  was  com- 
puted, assuming  that  the  underlying  trait  was  normally  distributed.  In  addi- 
tion, it  was  assumed  that  each  test  consisted  of  60  items,  each  having  the 
same  item-trait  correlation  (r) . The  distribution  of  item  difficulty  in  the 
dichotomous  case  can  be  described  as  a truncated  normal  distribution  with  a mean 
of  0.0  and  maximum  and  minimum  equal  to  1/r  and  -1/2%  respectively.  The  dis- 
tribution of  difficulty  of  the  highest  category  in  the  graded  model  was  also  a 
truncated  normal  distribution  but  with  a mean  of  . 40 / r*  and  maximum  and  minimum 
1/r  and  -.20/r.  Within  each  graded  item,  the  difficulty  of  each  of  the  lower 
categories  was  set  in  such  a way  that  the  categories  would  be  chosen  by  the 
same  proportion  of  testees.  The  comparison  assumes  that  there  are  five  graded- 
response  categories.  This  choice  of  difficulties  approaches  the  optimal  con- 
ditions for  the  two  models. 

The  ratio  of  the  mean  information  for  the  graded  model  over  that  of  the 
dichotomous  model  for  several  levels  of  test  homogeneity  is  seen  in  Table  3. 

For  example,  at  an  item-trait  correlation  of  r =.55  the  ratio  was  1.42.  This 
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means  that,  on  the  average,  the  use  of  partial  knowledge  will  be  42%  more 
informative  than  if  it  is  ignored.  Note  that  this  improvement,  due  to 
incorporating  partial  information  into  the  scores,  increased  as  the  discrim- 
ination of  the  test  increased.  In  other  words,  the  better  the  test,  the  more 
it  will  benefit  from  adding  partial  knowledge.  This  is  also  true  when  reliability 
rather  than  information  is  used  as  the  evaluative  criterion  (Bejar  & Weiss,  in 
press)  . 


Table  3 

Ratio  of  Mean  Information  of  Graded  to 
Dichotomous  Model,  as  a Function  of  Item-Trait  Correlation 


Item-Trait  correlation 

.55 

.63  .71  .77  .84  .95 

Ratio  of  mean  information 

1.42 

1.43  1.48  1.52  1.58  1.90 

The  advantages  derived  from  taking  partial  knowledge  into  account  can 
only  materialize  under  the  proper  conditions.  In  the  typical  multiple-choice 
test  item,  even  though  partial  knowledge  influences  which  alternative  is 
chosen,  the  response  is  scored  as  correct  or  incorrect.  One  way  of  allowing 
credit  to  be  given  for  partial  knowledge  is  to  instruct  testees  to  segregate 
alternatives  into  different  categories.  Coombs'  (1956)  procedure  is  an  in- 
stance of  the  approach  where  the  categories  are  "correct"  and  "incorrect". 
Other  categories  are  possible,  though;  e.g.,  verbal  items  may  be  classified 
as  "synonyms",  "antonyms",  or  "neither". 

Computerized  Testing 

Recording  and  scoring  responses  to  non-dichotomous  test  items  is  not, 
however,  convenient  with  paper-and-pencil  test  administration.  One  obvious 
use  of  interactive  computers,  therefore,  is  to  handle  the  recording  and 
scoring  of  responses  to  non-dichotomous  achievement  test  items.  But,  as 
previous  presentations  in  this  report  suggest,  the  computer  can  also  be  used 
to  adapt  or  tailor  the  test  to  each  individual. 

These  presentations  (and  indeed  most  of  the  research  in  computerized 
adaptive  testing)  have  been  oriented  toward  ability  measurement.  In 
achievement  testing,  it  is  possible  to  distinguish  between  two  kinds  of 
adaptive  test  administration:  One  involves  adapting  the  length  of  the  test, 

in  the  other,  the  difficulty  of  the  test  is  adapted. 

Adapting  the  length  of  the  test  to  the  individual  is  appropriate  in 
instructional  settings  where  each  individual  is  allowed  as  much  time  as  is 
necessary  to  complete  a given  unit  of  instruction.  Under  those  conditions, 
individual  differences  with  respect  to  knowledge  are  minimized  and  it  becomes 
profitable  to  adapt  the  length  of  the  test  rather  than  its  difficulty.  The 
research  of  Ferguson  (1970)  is  an  example  of  this  type  of  adaptive  testing. 

In  his  system,  an  individual  is  tested  until  he  is  classified  into  a non- 
mastery or  mastery  category.  The  statistical  basis  of  this  system  is  Wald's 
sequential  likelihood  ratio  test.  Ferguson's  model  assumes  that  the  dif- 
ficulty and  discrimination  of  all  items  are  the  same.  It  is  not  known  how 
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sensitive  the  procedure  is  with  respect  to  violation  of  these  assumptions. 

Thus,  research  addressed  to  this  question  is  needed.  It  would  also  be 
desirable  to  study  the  possibility  of  relaxing  the  model  to  allow  for  unequal 
item  difficulties  and  discriminations  as  well  as  allowing  for  polychotomous 
responses. 

Although  self-paced  instruction  has  many  advantages,  limited  resources 
often  do  not  permit  its  full  implementation.  As  a result,  the  sample  under 
instruction  will  likely  be  heterogeneous  with  respect  to  achievement.  Sim- 
ilarly, if  a test  is  intended  to  measure  retention  of  achievement  or  levels 
of  achievement  acquired  prior  to  instruction,  there  will  be  wide  variation  in 
levels  of  performance.  Under  these  conditions,  adapting  the  test  to  an 
individual's  level  of  achievement  will  be  more  efficient  than  the  conventional 
non-adaptive  procedure. 

Most  of  the  research  on  adaptive  testing  has  been  done  in  the  context 
of  dichotomous  response  models.  The  exceptions  are  to  be  found  in  the  work 
of  Bayroff,  Thomas,  and  Anderson  (1960),  Wood  (1971),  and  Samejima  (1976). 
line  of  the  major  aims  of  the  achievement/perf ormance  testing  research  at 
the  University  of  Minnesota  is  to  combine  the  advantages  of  partial  knowledge 
scoring  and  adaptive  testing.  Bayroff  e_t  al.  (1960)  seem  to  be  the  only 
researchers  who  have  actually  implemented  an  adaptive  testing  strategy  using 
non-dichotomous  items.  Essentially  what  they  did  was  to  branch  an  individual 
according  to  the  correctness  of  the  alternative  chosen.  Although  they  used 
a polychotomous  item  for  the  first  item  only,  this  can  be  readily  extended 
to  include  all  items.  Other  branching  rules  are  possible.  Wood  (1971)  sug- 
gested that  the  optimal  branching  rule  will  administer  as  the  next  item  the 
most  discriminating  of  those  items  with  a midpoint  of  adjacent  categories 
closest  to  the  individual's  current  estimated  achievement.  Samejima  (1976) 
implemented  a simulation  on  live  data  of  a similar  procedure,  which  she 
referred  to  as  tailoring  the  dichotomization  of  the  item  to  the  individual. 

She  noted  substantial  improvements  by  comparing  the  plot  of  scores  based  on 
a uniform  dichotomization  and  tailored  dichotomization  against  the  scores 
based  on  the  polychotomous  responses. 

Summary  and  Conclusions 

Two  recent  developments  in  test  theory  hold  promise  for  the  improvement 
of  achievement  test  scores.  In  combination,  adapting  the  test  to  the  indi- 
vidual and  simultaneously  extracting  more  information  from  each  response  by 
recording  partial  knowledge  should  result  in  greater  improvements  in  achievement 
test  scores  than  either  taken  alone.  The  use  of  non-dichotomous  item  formats, 
now  made  possible  by  the  administration  of  achievement  test  items  on  interactive 
computers,  should  result  in  achievement  tests  which  more  accurately  measure 
what  a student  has  learned  as  a result  of  instruction. 

Although  the  use  of  polychotomous  models  in  the  measurement  of  partial 
knowledge  has  been  emphasized  here,  it  is  clear  that  these  models  have  much 
to  offer  in  performance  testing  as  well.  Fitzpatrick  and  Morrison  (1970) 
define  a performance  test  as  "one  in  which  some  criterion  situation  is 
simulated  to  a much  greater  degree  than  represented  by  the  usual  paper-and- 
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pencil  test."  Unlike  paper-and-pencil  tests,  performance  tests  are  relatively 
expensive  and  it  is  this  cost  consideration  that  highlights  the  necessity 
for  extracting  as  much  information  as  possible  from  a testee's  set  of  re- 
sponses. Polychotomous  response  models  make  this  feasible.  The  use  of 
interactive  computers  also  has  much  to  offer  in  the  area  of  performance  testing, 
for  computerized  test  administration  can  make  it  possible  to  represent  simulated 
situations  conveniently  and  economically.  Additional  savings  are  likely  by 
testing  individuals  only  on  those  skills  which  match  the  individual's  level 
of  training. 

In  short,  it  seems  that  coupling  polychotomous  response  model  theory  with 
interactive  computer  administration  of  tests  is  likely  to  result  in  more 
accurate  and,  in  the  long  run,  more  economical  assessments  of  achievement  and 
performance . 
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